# Pandas - Reading data

This notebook is the second part of the collection devoted to the pandas library.

It explores the ways how data can be imported into DataFrames. 

More details can be found in the official documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql

Most of the functions for reading data are named `pandas.read_XXX`, where XXX is the format used. We will go through the most commonly used ones.

In [None]:
# Necesssary import evil

import jupy_helpers
import pandas as pd
from IPython.display import display

In [None]:
# List function for input in pandas.

print("\n".join(method for method in dir(pd) if method.startswith("read_")))

## Read CSV

Nowadays, a lot of data comes in the textual Comma-separated values format (CSV).
Although not properly standardized, it is the de-facto standard for files that are not
huge and are meant to be read by human eyes too.

Let's read the ratings of several (hundred) movies from Rotten Tomatoes:

In [None]:
%head ../data/rotten_tomatoes_top_movies_2019-01-15.csv 10

In [None]:
rotten_df = pd.read_csv("../data/rotten_tomatoes_top_movies_2019-01-15.csv")
rotten_df.head(9)

The automatic data type parsing automatically converts columns to appropriate types:

In [None]:
rotten_df.dtypes

Sometimes the CSV input does not work out of the box. Although pandas automatically understands and reads zipped files,
it usually does not automatically infer the file format - for details, see the `read_csv` documentation here: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
pd.read_csv('../data/title.basics.tsv.gz')

...in this case, the CSV file does not use commas to separate values. Therefore, we need to specify a few more arguments:

In [None]:
imdb_titles = pd.read_csv('../data/title.basics.tsv.gz', sep='\t')
imdb_titles.head()

Noticed the `\N` endYear values?

**Exercise:** Use `na_values` argument to mark `\N` as a null (missing) value. 

In [None]:
%exercise

# imdb_titles = pd.read_csv('../data/title.basics.tsv.gz', sep='\t', na_values=...)
imdb_titles = pd.read_csv('../data/title.basics.tsv.gz', sep='\t', na_values="\\N")

In [None]:
%validate

assert pd.isna(imdb_titles.loc[0, 'endYear'])

See the difference?

In [None]:
imdb_titles.head()

## Read Excel

Let's read the list of best movies by genre from Guardian (a bit old, written in 2010).

![Screenshot](guardian-best-horrors.png)

In [None]:
pd.read_excel("../data/guardian-greatest_films_of_all_time.xlsx")

Hmmmmph... Pandas parsed just the first spreadsheet. Let's see what are the options. If in doubt, look in the documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/io.html#excel

In [None]:
xlsx = pd.io.excel.ExcelFile("../data/guardian-greatest_films_of_all_time.xlsx")
xlsx

In [None]:
xlsx.sheet_names

In [None]:
xlsx.parse("HORROR")

In [None]:
%exercise

#crimes =...                    # Find the table of crime movies
#tenth_best = crimes.loc[...]   # Find the 10-th best crime movie
#movie_name = ...               # Get the name of the movie

#movie_name

crimes = pd.read_excel("../data/guardian-greatest_films_of_all_time.xlsx", "CRIME")
tenth_best = crimes.loc[9]
movie_name = tenth_best["Film"]

In [None]:
%validate

assert movie_name[7:9] == "la"

## Read JSON

In [None]:
wiki_movies = pd.read_json("../data/wikipedia-movies.json")
wiki_movies.head(10)

## Read SQL

On its own, pandas can read SQLite databases. If **sqlalchemy** package is installed, pandas allows to access
any database that is supported by the former library.

In [None]:
# This requires sqlalchemy
award_table = pd.read_sql("awards", con='sqlite:///../data/awards.sqlite')
award_table.tail(20)

In [None]:
# It is possible to pass a SQL query too (no sqlalchemy necessary with sqlite3)
import sqlite3
connection = sqlite3.connect("../data/awards.sqlite")

awards2017 = pd.read_sql("SELECT * FROM awards WHERE Year=2017", con=connection)
awards2017

## Read HTML

Pandas is able to scrape data from tables embedded in web pages using the `read_html` function.
This might or might not bring you good results and probably you will have to tweak your
data frame manually. But it is a good starting point - much better than being forced to parse
the HTML ourselves!

Let's download a list of highest-grossing films from wikipedia!

In [None]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films")
type(tables), len(tables)

Does the page really contain 95 tables? The number is quite high and we must check which of the tables
are meaningful and which are not. We are mostly interested in the first displayed one.

**Exercise:** Find **i** to obtain the right table:

In [None]:
%exercise

# i = ...

i = 0
table = tables[i]
table.head(10)

In [None]:
%validate

assert table.iloc[2]["Title"] == "Titanic"  # 3rd msot grossing movie ever

## Write CSV

Pandas is able to write to many various formats but the usage is similar. 

In [None]:
award_table.to_csv("awards.csv", index=False)

In [None]:
%head awards.csv 10

## Write SQL

Load all data for the rest of the workshop and save as into local sqlite database.

**Note**: This is an important step. We will use the data in the later phases.
If in doubt, refer to the "solution" version of this file (TODO: link).

In [None]:
workshop_data = dict(
    imdb_titles = imdb_titles,
    imdb_ratings = pd.read_csv('../data/title.ratings.tsv.gz', sep='\t'),
    boxoffice = pd.read_csv('../data/boxoffice_march_2019.csv.gz'),
    rotten_tomatoes = rotten_df,
    awards = award_table
)

In [None]:
con = 'sqlite:///./workshop_data.sqlite'

for name, df in workshop_data.items():
    df.to_sql(name, con, if_exists="replace", index=False)

Note: When done with this notebook, we suggest that you shutdown the kernel to free the memory.