# Pandas Input/output

Can read: `.csv`, `.json`, `.xlsx`, `.parquet`, `.db`, `.hdf`, ...


https://pandas.pydata.org/pandas-docs/stable/reference/io.html

In [1]:
import pandas as pd

### Our Data

- `attacks.csv` from Kaggle
- `github_pulls.json` from GitHub API
- `2021_Accidentalidad.xlsx` from Datos Abiertos Ayuntamiento de Madrid
- `test.parquet` from Kaggle

__.CSV Files__

In [2]:
# Raw "Comma"-Separated Values file (a.k.a.: CSV file)

with open('./datasets/attacks.csv', encoding="iso8859_15") as f:
    lines = f.readlines()
lines

['Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex ,Age,Injury,Fatal (Y/N),Time,Species ,Investigator or Source,pdf,href formula,href,Case Number,Case Number,original order,,\n',
 '2018.06.25,25-Jun-2018,2018,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and paddle damaged",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,2018.06.25,2018.06.25,6303,,\n',
 '2018.06.18,18-Jun-2018,2018,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson\xa0McNeely ,F,11,Minor injury to left thigh,N,14h00  -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,2018.06.18,2018.06

In [3]:
# Import .CSV 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#

df_csv = pd.read_csv('./datasets/attacks.csv',encoding="iso8859_15")
df_csv.head(10)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,
5,2018.06.03.b,03-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,"Flat Rock, Ballina",Kite surfing,Chris,M,...,,"Daily Telegraph, 6/4/2018",2018.06.03.b-FlatRock.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.b,2018.06.03.b,6298.0,,
6,2018.06.03.a,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,...,Tiger shark,"Diario de Pernambuco, 6/4/2018",2018.06.03.a-daSilva.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.a,2018.06.03.a,6297.0,,
7,2018.05.27,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,...,"Lemon shark, 3'","K. McMurray, TrackingSharks.com",2018.05.27-Ponce.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27,2018.05.27,6296.0,,
8,2018.05.26.b,26-May-2018,2018.0,Unprovoked,USA,Florida,"Cocoa Beach, Brevard County",Walking,Cody High,M,...,"Bull shark, 6'","K.McMurray, TrackingSharks.com",2018.05.26.b-High.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.26.b,2018.05.26.b,6295.0,,
9,2018.05.26.a,26-May-2018,2018.0,Unprovoked,USA,Florida,"Daytona Beach, Volusia County",Standing,male,M,...,,"K. McMurray, Tracking Sharks.com",2018.05.26.a-DaytonaBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.26.a,2018.05.26.a,6294.0,,


In [4]:
# Let's explore the dataset

print(df_csv.shape)
print(df_csv.info())
df_csv_short = df_csv[['Date', 'Type', 'Country', 'Injury']]
df_csv_short = df_csv_short.dropna(subset=['Country'])
display(df_csv_short)
print(df_csv_short.shape)
print(df_csv_short.info())

(25723, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 34

Unnamed: 0,Date,Type,Country,Injury
0,25-Jun-2018,Boating,USA,"No injury to occupant, outrigger canoe and pad..."
1,18-Jun-2018,Unprovoked,USA,Minor injury to left thigh
2,09-Jun-2018,Invalid,USA,Injury to left lower leg from surfboard skeg
3,08-Jun-2018,Unprovoked,AUSTRALIA,Minor injury to lower leg
4,04-Jun-2018,Provoked,MEXICO,Lacerations to leg & hand shark PROVOKED INCIDENT
...,...,...,...,...
6297,Before 1903,Unprovoked,AUSTRALIA,FATAL
6298,Before 1903,Unprovoked,AUSTRALIA,FATAL
6299,1900-1905,Unprovoked,USA,FATAL
6300,1883-1889,Unprovoked,PANAMA,FATAL


(6252, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6252 entries, 0 to 6301
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     6252 non-null   object
 1   Type     6248 non-null   object
 2   Country  6252 non-null   object
 3   Injury   6226 non-null   object
dtypes: object(4)
memory usage: 244.2+ KB
None


In [5]:
# Create a new .csv file

df_csv_short.to_csv('./datasets/shark_attacks_short.csv', sep=';')

In [6]:
df_csv_s = pd.read_csv('./datasets/shark_attacks_short.csv', sep=";")

df_csv_s.head()

Unnamed: 0.1,Unnamed: 0,Date,Type,Country,Injury
0,0,25-Jun-2018,Boating,USA,"No injury to occupant, outrigger canoe and pad..."
1,1,18-Jun-2018,Unprovoked,USA,Minor injury to left thigh
2,2,09-Jun-2018,Invalid,USA,Injury to left lower leg from surfboard skeg
3,3,08-Jun-2018,Unprovoked,AUSTRALIA,Minor injury to lower leg
4,4,04-Jun-2018,Provoked,MEXICO,Lacerations to leg & hand shark PROVOKED INCIDENT


---

__.JSON Files__

In [7]:
# Raw JavaScript Object Notation file (a.k.a.: JSON file)

import json

with open('./datasets/github_pulls.json', encoding="utf8") as f:
    json_file = json.load(f)

json_file

[{'url': 'https://api.github.com/repos/ta-data-mad/dataptmad1120/pulls/545',
  'id': 659270528,
  'node_id': 'MDExOlB1bGxSZXF1ZXN0NjU5MjcwNTI4',
  'html_url': 'https://github.com/ta-data-mad/dataptmad1120/pull/545',
  'diff_url': 'https://github.com/ta-data-mad/dataptmad1120/pull/545.diff',
  'patch_url': 'https://github.com/ta-data-mad/dataptmad1120/pull/545.patch',
  'issue_url': 'https://api.github.com/repos/ta-data-mad/dataptmad1120/issues/545',
  'number': 545,
  'state': 'open',
  'locked': False,
  'title': '[intro-to-ml] Alejandra Matías',
  'user': {'login': 'alejandramatiasmartin2',
   'id': 73171652,
   'node_id': 'MDQ6VXNlcjczMTcxNjUy',
   'avatar_url': 'https://avatars.githubusercontent.com/u/73171652?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/alejandramatiasmartin2',
   'html_url': 'https://github.com/alejandramatiasmartin2',
   'followers_url': 'https://api.github.com/users/alejandramatiasmartin2/followers',
   'following_url': 'https://api.githu

In [8]:
# Import .JSON 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.read_json.html#pandas.io.json.read_json

df_json = pd.read_json('./datasets/github_pulls.json')
df_json.head()

Unnamed: 0,url,id,node_id,html_url,diff_url,patch_url,issue_url,number,state,locked,...,review_comments_url,review_comment_url,comments_url,statuses_url,head,base,_links,author_association,auto_merge,active_lock_reason
0,https://api.github.com/repos/ta-data-mad/datap...,659270528,MDExOlB1bGxSZXF1ZXN0NjU5MjcwNTI4,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://api.github.com/repos/ta-data-mad/datap...,545,open,False,...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,{'label': 'alejandramatiasmartin2:intro-to-ml'...,"{'label': 'ta-data-mad:master', 'ref': 'master...",{'self': {'href': 'https://api.github.com/repo...,NONE,,
1,https://api.github.com/repos/ta-data-mad/datap...,622553769,MDExOlB1bGxSZXF1ZXN0NjIyNTUzNzY5,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://api.github.com/repos/ta-data-mad/datap...,544,open,False,...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,{'label': 'CastillaAlvaro:natural-language-pro...,"{'label': 'ta-data-mad:master', 'ref': 'master...",{'self': {'href': 'https://api.github.com/repo...,NONE,,
2,https://api.github.com/repos/ta-data-mad/datap...,620405067,MDExOlB1bGxSZXF1ZXN0NjIwNDA1MDY3,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://api.github.com/repos/ta-data-mad/datap...,543,open,False,...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,{'label': 'alejandramatiasmartin2:two-sample-h...,"{'label': 'ta-data-mad:master', 'ref': 'master...",{'self': {'href': 'https://api.github.com/repo...,NONE,,
3,https://api.github.com/repos/ta-data-mad/datap...,618259082,MDExOlB1bGxSZXF1ZXN0NjE4MjU5MDgy,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://api.github.com/repos/ta-data-mad/datap...,542,open,False,...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,{'label': 'alejandramatiasmartin2:hypothesis-t...,"{'label': 'ta-data-mad:master', 'ref': 'master...",{'self': {'href': 'https://api.github.com/repo...,NONE,,
4,https://api.github.com/repos/ta-data-mad/datap...,616778947,MDExOlB1bGxSZXF1ZXN0NjE2Nzc4OTQ3,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://github.com/ta-data-mad/dataptmad1120/p...,https://api.github.com/repos/ta-data-mad/datap...,541,open,False,...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,https://api.github.com/repos/ta-data-mad/datap...,{'label': 'manuaq:natural-language-processing'...,"{'label': 'ta-data-mad:master', 'ref': 'master...",{'self': {'href': 'https://api.github.com/repo...,NONE,,


In [9]:
# Let's explore the dataset

print(df_json.shape)
print(df_json.info())

(30, 36)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   url                  30 non-null     object             
 1   id                   30 non-null     int64              
 2   node_id              30 non-null     object             
 3   html_url             30 non-null     object             
 4   diff_url             30 non-null     object             
 5   patch_url            30 non-null     object             
 6   issue_url            30 non-null     object             
 7   number               30 non-null     int64              
 8   state                30 non-null     object             
 9   locked               30 non-null     bool               
 10  title                30 non-null     object             
 11  user                 30 non-null     object             
 12  body           

In [10]:
# Dict flatten

df_json_new = list(df_json['_links'])
df_json_new = pd.DataFrame(df_json_new)
df_json_new.head()

Unnamed: 0,self,html,issue,comments,review_comments,review_comment,commits,statuses
0,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://github.com/ta-data-mad/datap...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...
1,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://github.com/ta-data-mad/datap...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...
2,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://github.com/ta-data-mad/datap...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...
3,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://github.com/ta-data-mad/datap...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...
4,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://github.com/ta-data-mad/datap...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...,{'href': 'https://api.github.com/repos/ta-data...


In [11]:
# Create a new .JSON file

df_json_new.to_json('./datasets/github_pulls_new.json')

---

__.XLSX Files__

In [12]:
# Additional libraries for Excel files

#!conda install -y xlrd
#!conda install -y openpyxl

In [13]:
# Import .XLSX 
# https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

df_excel = pd.read_excel('./datasets/2021_Accidentalidad.xlsx', sheet_name='Accidentes_2021')
df_excel.head()

Unnamed: 0,num_expediente,fecha,hora,localizacion,numero,distrito,tipo_accidente,estado_meteorológico,tipo_vehiculo,tipo_persona,rango_edad,sexo,lesividad,coordenada_x_utm,coordenada_y_utm,positiva_alcohol,positiva_droga
0,2020S019534,2021-01-01,04:30:00,AVDA. PABLO NERUDA / CALL. LEONESES,57,PUENTE DE VALLECAS,Colisión fronto-lateral,Despejado,Turismo,Conductor,Desconocido,Desconocido,,444926.3,4470383.11,N,
1,2020S019534,2021-01-01,04:30:00,AVDA. PABLO NERUDA / CALL. LEONESES,57,PUENTE DE VALLECAS,Colisión fronto-lateral,Despejado,Turismo,Conductor,De 30 a 34 años,Mujer,14.0,444926.3,4470383.11,N,
2,2020S019534,2021-01-01,04:30:00,AVDA. PABLO NERUDA / CALL. LEONESES,57,PUENTE DE VALLECAS,Colisión fronto-lateral,Despejado,Turismo,Conductor,De 35 a 39 años,Hombre,7.0,444926.3,4470383.11,N,
3,2020S019534,2021-01-01,04:30:00,AVDA. PABLO NERUDA / CALL. LEONESES,57,PUENTE DE VALLECAS,Colisión fronto-lateral,Despejado,Turismo,Pasajero,De 10 a 14 años,Hombre,14.0,444926.3,4470383.11,N,
4,2020S019534,2021-01-01,04:30:00,AVDA. PABLO NERUDA / CALL. LEONESES,57,PUENTE DE VALLECAS,Colisión fronto-lateral,Despejado,Turismo,Pasajero,De 35 a 39 años,Mujer,14.0,444926.3,4470383.11,N,


In [14]:
# Let's explore the dataset

print(df_excel.shape)
print(df_excel.info())

(24464, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24464 entries, 0 to 24463
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   num_expediente        24464 non-null  object        
 1   fecha                 24464 non-null  datetime64[ns]
 2   hora                  24464 non-null  object        
 3   localizacion          24464 non-null  object        
 4   numero                24462 non-null  object        
 5   distrito              24461 non-null  object        
 6   tipo_accidente        24463 non-null  object        
 7   estado_meteorológico  21863 non-null  object        
 8   tipo_vehiculo         24371 non-null  object        
 9   tipo_persona          24454 non-null  object        
 10  rango_edad            24464 non-null  object        
 11  sexo                  24464 non-null  object        
 12  lesividad             13293 non-null  object        
 13  coor

---

__.PARQUET Files__

In [15]:
# Additional library for Parquet files

#!conda install -c conda-forge pyarrow

In [16]:
# Import .PARQUET (column-oriented data storage format with schema)
# https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

df_parquet = pd.read_parquet('./datasets/test.parquet')
df_parquet.head()

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [17]:
# Let's explore the dataset (https://www.kaggle.com/dschettler8845/recsys-2020-ecommerce-dataset)

print(df_parquet.shape)
print(df_parquet.info())

NameError: name 'df_parquet' is not defined

---

__SQL Files...in the next episode...__