# Advanced Data parsing modules and packages

Data parsing and manipulation is a cornerstone of Python functionality for the Data Scientist.  For every file format, there is typically one or more Python libraries which are suitable for parsing the file into memory data structures.

We will look at a few examples and you will continue this practical exploration for the the end of day learning exercises.

 
## Pandas ... again  ![Panda Sleeping](../images/panda-sleep-2.jpg)

Pandas library is very useful. Please read through 
https://pandas.pydata.org/pandas-docs/stable/overview.html#data-structures

As you have seen before, we will import some libraries. Specifically, NUMPY and PANDAS and this time we are aliasing them as np and pd, respectively.

**Example: loading a file and examining the column headers**

```
import numpy as np
import pandas as pd
# Here we create a dataframe
df = pd.read_csv('/dsa/data/all_datasets/SyriaIDPSites2015LateJunHIUDoS.csv', encoding='latin-1')
# Show the Column Headings
print(df.columns)
```

```
Index(['Description', 'Country', 'ADM1', 'ADM2', 'ADM3', 'ADM4', 'Latitude',
       'Longitude', 'Name', 'pcode', 'fips', 'iso_alpha2', 'iso_alpha3',
       'iso_num', 'stanag', 'tld'],
      dtype='object')
```

## JavaScript Object Notation (JSON)

The library that we previously saw, json, is capable of both reading and writing JSON.

```
import json

file_data = json.load(open('/dsa/data/all_datasets/Syria_IDPSites_2015LateJun_HIU_DoS.geojson', encoding='latin-1'))
print(str(file_data)[1:300], " ...")
```

```
'crs': {'properties': {'name': 'urn:ogc:def:crs:EPSG::4326'}, 'type': 'name'}, 'totalFeatures': 52, 'type': 'FeatureCollection', 'features': [{'geometry': {'coordinates': [36.447, 32.588], 'type': 'Point'}, 'properties': {'iso_alpha2': 'SY', 'stanag': 'SYR', 'iso_alpha3': 'SYR', 'tld': '.sy', 'Name  ...
```


See https://docs.python.org/3/library/json.html for additional information.

## Comma Separated Values (CSV)

Example loading a CSV file, then writing a JSON file.

```
import csv
import json
# Open an output file, in (w)rite mode
jsonfile = open('MyOutput.json', 'w')
# Read File Data, using the dictionary reader
# Define the field names.  Look into CSV library deeper for details
fieldnames = ('Description', 'Country', 'ADM1', 'ADM2', 'ADM3', 'ADM4', 'Latitude', 'Longitude', 'Name', 'pcode', 'fips', 'iso_alpha2', 'iso_alpha3', 'iso_num', 'stanag', 'tld')
file_data = csv.DictReader(open('SyriaIDPSites2015LateJunHIUDoS.csv', encoding='latin-1'), fieldnames)
# Now transform
for data_row in file_data:
    json.dump(data_row, jsonfile)
    jsonfile.write('\n')
jsonfile.close();
```
 

See https://docs.python.org/3/library/csv.html  for additional information.



## BeautifulSoup

Beautiful Soup is a great HTML/XML parser. 

This example parses a KML file, which is a version of XML.  HTML and XML are hierarchical data files, but still text-based files.

```
# Beautiful Soup
from bs4 import BeautifulSoup
 
# Open the file as a file object
# Parse the file object into a XML Structured Document, AKA : KML Document Object Model (DOM)
with open('geonode-Syria_IDPSites_2015LateJun_HIU_DoS.kml') as kml_file_object:
    dom = BeautifulSoup(kml_file_object,'xml')
    
for pm in dom.findAll('Placemark'):
    for name in pm.findAll('name'):
        nametext = name.find(text = True)
        print(nametext)
```

Please review this link for much information on beautiful soup! http://www.crummy.com/software/BeautifulSoup/bs4/doc/

 

 
## General Advanced Parsing

In general, given a file format that is commonly known, someone else has already written Python code to handle it.  Search the internet for python and the name of the file type together.

Once you have parsed the data into memory, you simply need to iterate through it (recall Think Python, Chapter 7).

The data may be in pandas or moved to pandas, the choice is often yours and just a couple lines of code away.  When you work through the Database and SQL course, you will see this concept applied for MS Excel Spreadsheet files.



---