### import statements

Typically top of every module has import statements for important modules

* import os: import the os module
* import pandas as pd: import the pandas module but call it "pd" (conventional shorthand for pandas) for readable code.

In [1]:
import os
import pandas as pd

### read_csv (or read_excel)

Functions to read a file into a pandas DataFrame.

* A DataFrame is like a table, with varying types.
* Each column of a pandas DataFrame is effectively a pandas Series.
* The DataFrame info() command provides names, types and counts for columns.  When NULLs are encountered the count will be less than the size of the DataFrame.

In [2]:
filename = "data.tsv"
imdb_df = pd.read_csv(filename, sep='\t')  
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5723534 entries, 0 to 5723533
Data columns (total 9 columns):
tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           int64
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtypes: int64(1), object(8)
memory usage: 393.0+ MB


In [9]:
reduced_df = imdb_df[imdb_df.startYear != '\\N'].copy()


In [11]:
reduced_df['startYearInt'] = reduced_df.startYear.astype('int')
reduced_df = reduced_df[reduced_df.startYearInt>1990]
reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312405 entries, 15495 to 5723533
Data columns (total 10 columns):
tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           int64
startYear         object
endYear           object
runtimeMinutes    object
genres            object
startYearInt      int64
dtypes: int64(2), object(8)
memory usage: 361.9+ MB


In [12]:
reduced_df = reduced_df[['primaryTitle','isAdult','startYear','genres']]
reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312405 entries, 15495 to 5723533
Data columns (total 4 columns):
primaryTitle    object
isAdult         int64
startYear       object
genres          object
dtypes: int64(1), object(3)
memory usage: 164.5+ MB


In [13]:
reduced_df.to_csv('imdb_titles_reduced.csv')

### describe()

describe() can be called on a particular column of a DataFrame (a Series) to find out min, max, median, etc.

In [None]:
dramacom_df = dramacom_df[dramacom_df.StartYearInt >= 2000]

In [None]:
dramacom_df[dramacom_df.StartYearInt >= 2115]

In [None]:
dramacom_df = dramacom_df[dramacom_df.StartYearInt <2020]
dramacom_df.StartYearInt.value_counts()

In [None]:
pd.crosstab(dramacom_df.genres, dramacom_df.StartYearInt)

In [None]:
pd.crosstab(dramacom_df.genres, dramacom_df.IsAdult)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
pdata = pd.crosstab(dramacom_df.genres, dramacom_df.StartYearInt)
type(pdata)

In [None]:
pdata.info()

In [None]:
pdata = pd.crosstab(dramacom_df.StartYearInt, dramacom_df.genres)
pdata.info()

In [None]:
pdata.plot.line()