# Data sources. Other formats. 

* Parquet
* JSON
* Shapefile
* PC-Axis
* ...

Pandas documentation: https://pandas.pydata.org/docs/index.html

Pandas API reference: https://pandas.pydata.org/docs/reference/index.html#api

In [None]:
# import pandas library
import pandas as pd

### Parquet

Parquet is a compressed binary format that stores data in column mode (instead of the usual row mode)

It's especially suitable for Datasets with a large numbre of columns.

### Social vulnerability Dataset

Source: Data.gov (US Open Data Project)

https://catalog.data.gov/dataset/social-vulnerability-index-2018-united-states-tract 


In [None]:
# CSV File greater than 200MB
df = pd.read_csv('/huge/datasets/Social_Vulnerability_Index_2018_-_United_States__tract.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.to_parquet('/huge/datasets/Social_Vulnerability_Index_2018_-_United_States__tract.parquet')

In [None]:
# Parquet is a binary format (don't try to open the file with a text editor)
# Note the compression rate of parquet format, look at the file size.

In [None]:
df_pq = pd.read_parquet('/huge/datasets/Social_Vulnerability_Index_2018_-_United_States__tract.parquet')
df_pq.head()                        

In [None]:
# Compare the reading execution time of the dataset in CSV vs Parquet?
# Do you notice any difference?

In [None]:
# Use %time python magic to measure the execution time
%time df = pd.read_csv('/huge/datasets/Social_Vulnerability_Index_2018_-_United_States__tract.csv')

In [None]:
# Use %time python magic to measure the execution time
%time df_pq = pd.read_parquet('/huge/datasets/Social_Vulnerability_Index_2018_-_United_States__tract.parquet')


In [None]:
df_pq.sample(10)

### JSON

JSON is a text based format with the Javascript Sintax Notation.

In [None]:
# Sample JSON file with data Series
# https://www.w3schools.com/python/pandas/data.js
df = pd.read_json('../datasets/data_series.json')

In [None]:
df

In [None]:
# Show the dataframe in text mode
df.to_string()
#print(df.to_string())

In [None]:
# Select Pulse and Calories with Duration 60
df60 = df[df.Duration == 60][['Pulse','Calories']]
df60

In [None]:
# Save results in a new JSON file
df60.to_json('output/data_60.json')

In [None]:
# Open the new file with a text editor

In [None]:
# Sample JSON file with data in one level only, no nested data
# https://github.com/ankitgoel1602/data-science/tree/master/json-data
df = pd.read_json('../datasets/level_1.json')

In [None]:
df

In [None]:
# JSON complex files
# https://towardsdatascience.com/how-to-parse-json-data-with-python-pandas-f84fbd0b1025
# pd.read_json won't be enough with nested data in JSON files. 

In [None]:
# Load a remote JSON dataset
# Dataset with cities of the world
url = 'https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json'
df = pd.read_json(url)
df

In [None]:
# Search city: Lugo
df[df.name == 'Lugo']

In [None]:
# Search city: Coruña
df[df.name.str.contains('Coruña')]

In [None]:
# Search many elements
df[df.name.isin(['Lugo','Pontevedra'])]

In [None]:
# We can explore the dataset and make questions:

In [None]:
# Question: how many cities are there in Spain?
df[df.country == 'ES'].name.count()

In [None]:
# and France?
df[df.country == 'FR'].name.count()

In [None]:
# Diference between cities in France and Spain
df[df.country == 'FR'].name.count() - df[df.country == 'ES'].name.count()

### Shapefile

There are many file extensions related to Shapefiles: .shp, .dbf, .shx...

Shapefile is a format designed to store vector grapich information, as well as related information.

Format popularized by geographical data processing applications such as QGIS or ArcGIS

Shapefile is a collection of three basic files: .shp, .shx and .dbf

The three files must be in the same directory.

 +info: https://mxd.codes/articles/what-is-a-shapefile-shp-dbf-and-shx

In [None]:
# Import geopandas library
import geopandas as gpd

In [None]:
!conda install geopandas -y

### Galician municipalities 

Source: Sergas.es // GIS: Cartografía de Galicia en formato vectorial SHP para Sistemas de Información Xeográfica 

https://www.sergas.es/Saude-publica/GIS-Concellos


In [None]:
# Load shapefile into a GeoDataFrame
gdf = gpd.read_file('../datasets/Concellos/Concellos_IGN.shp')

In [None]:
gdf

In [None]:
# Geopandas creates a new kind of DataFrame
type(gdf)

In [None]:
# Draw the DataFrame
gdf.plot()

In [None]:
# Draw only a province: Pontevedra
gdf[gdf.Provincia == 'Pontevedra'].plot()

### PC-Axis

PC-Axis is a especial format designed by statistical programs. It's quite present in public administrations.

The original application is from Sweden: PXwin, PxWeb and PXEdit

It uses the .px extension

In [None]:
# You can't find the library in Conda repos so you can try pip.
# https://pypi.org/project/pyaxis/
!pip install pyaxis

In [None]:
#import pyaxis library
from pyaxis import pyaxis 

### Criminality in Spain

Source: Ministerio de Interior

https://estadisticasdecriminalidad.ses.mir.es/publico/portalestadistico/balances.html


In [None]:
# Load file
px = pyaxis.parse('../datasets/criminalidade_2021_t3.px', encoding='ISO-8859-2')

In [None]:
px

In [None]:
px['DATA']

In [None]:
px['METADATA']

### Compressed files

Pandas allows to read and import csv compressed files in different formats: .zip, .tar.xz etc...

In [None]:
df = pd.read_csv('../datasets/La_Liga_Winners.tar.xz')
df.head()

In [None]:
df = pd.read_csv('../datasets/La_Liga_Winners.zip')
df.head()

In [None]:
# Save to a compressed CSV file the large amount of victories of teh Deportivo in the Liga
# Try gz format (gz->OK , zip->some problems?)
deportivo = df[df.Winner.str.contains('Deportivo')]
deportivo.to_csv('output/deportivo_wins.csv.gz',index=False,compression='gzip')