<div class="alert alert-block alert-info">

# 06_Lecture - Analysis of a DataFrame
</div>

In [1]:
import os
print(os.getcwd())
import numpy as np
import pandas as pd

/Users/dev/Downloads


<hr style="height:2px; border-width:0; color:pink; background-color:pink">

## Load data and Analyze
<br> </br>
<li><font color=darkblue> We will load the file <font color=red>movies.csv</font></font> </li>
<li><font color=darkblue> There are three columns in this dataset - movieId, title and genres</font> </li>
<li><font color=darkblue> movieid is theoretically the primary key and each movieid corresponds to a unique title </font> </li>

In [4]:
movies_df = pd.read_csv(str(os.getcwd())+"/movies.csv")

In [5]:
movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


<font color=darkblue> The attribute <font color=red>.dtypes</font> provides data types for each column in a DataFrame. </font>

In [6]:
movies_df.dtypes

movieId     int64
title      object
genres     object
dtype: object

<font color=darkblue> The method <font color=red>.info()</font> provides non-null counts and dtype counts among others. </font>

In [7]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


<font color=darkblue> The method <font color=red>.select_dtypes()</font> can be used to include or exclude columns by their dtypes. </font>

In [8]:
# Exclude by dtype
movies_df.select_dtypes(exclude = ["object", "int64"]) # NOTE: int64 and int are different

0
1
2
3
4
...
9737
9738
9739
9740
9741


In [9]:
# Inlclude certain dtype
movies_df.select_dtypes(include = "object")

Unnamed: 0,title,genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Comedy|Romance
3,Waiting to Exhale (1995),Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Comedy
...,...,...
9737,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,Flint (2017),Drama
9740,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


<font color=darkblue> The method <font color=red>.describe()</font> provides basic properties of numeric columns' distribution. </font>

In [10]:
movies_df.describe() # the parameter percentiles can be used to generate additional percentiles.
# movies_df.describe(percentiles=[.10,.25,.60])

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


In [11]:
# Another alternative to generate quantiles is:
movies_df.quantile([0.1, 0.25, 0.3, 0.4, 0.5, 0.6, 0.75, 0.8, 1.0])

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [None]:
# A quick way to identify number of rows
len(movies_df)

In [12]:
# To get the number of rows and columns of the DataFrame
movies_df.shape

(9742, 3)

<font color=darkblue> The method <font color=red>value_counts()</font> provides the number of times a value in a column occurs.</font>

In [13]:
movies_df['title'].value_counts() #.value_counts(normalize=True) will return the proportion of a value within that column
# Hmm, we are seeing the same time title repeat more than once.
# We thought each title title is uniqie to a movieId and that each movieId should only be present once

title
Emma (1996)                               2
War of the Worlds (2005)                  2
Confessions of a Dangerous Mind (2002)    2
Eros (2004)                               2
Saturn 3 (1980)                           2
                                         ..
Lost and Delirious (2001)                 1
Rape Me (Baise-moi) (2000)                1
Alice (1990)                              1
Another Woman (1988)                      1
Andrew Dice Clay: Dice Rules (1991)       1
Name: count, Length: 9737, dtype: int64

In [14]:
type(movies_df['title'].value_counts())

pandas.core.series.Series

In [15]:
# let's first identify which titles are repeating
movies_df['title'].value_counts().loc[lambda x : x>1] # Remember I mentioned Method Chaining earlier? This is one such example.

title
Emma (1996)                               2
War of the Worlds (2005)                  2
Confessions of a Dangerous Mind (2002)    2
Eros (2004)                               2
Saturn 3 (1980)                           2
Name: count, dtype: int64

In [16]:
# Another example of method chaining
movies_df['title'].value_counts().head(6)

title
Emma (1996)                               2
War of the Worlds (2005)                  2
Confessions of a Dangerous Mind (2002)    2
Eros (2004)                               2
Saturn 3 (1980)                           2
Paranoid Park (2007)                      1
Name: count, dtype: int64

<font color=darkblue> The method <font color=red>isna()</font> is used to identify if there are missing values.</font>

In [17]:
movies_df.isna() # returns True if there is missing value. Too many rows to sift through!

Unnamed: 0,movieId,title,genres
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
9737,False,False,False
9738,False,False,False
9739,False,False,False
9740,False,False,False


In [18]:
# A convient way to quantify any missing values
movies_df.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

<font color=darkblue> Get a slice of a dataset based on a condition.</font>

In [19]:
# Inspect all the rows with the title 'Saturn 3 (1980)'
movies_df[movies_df['title']=='Saturn 3 (1980)']

Unnamed: 0,movieId,title,genres
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


<li><font color=darkblue> We will load the file <font color=red>ratings.csv</font></font> </li>
<li><font color=darkblue> There are four columns in this dataset - userId | movieId | rating | timestamp</font> </li>
<li><font color=darkblue> Timestamp column is seconds since UTC Jan 1, 1970 + seconds (Unix epoch start). We will convert this to a normal datetime. </font> </li>

In [20]:
ratings_df = pd.read_csv(str(os.getcwd())+'/ratings.csv')
ratings_df

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dev/Downloads/ratings.csv'

In [21]:
# Convert epoch to datetime
ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'].astype(int), unit='s')
ratings_df
# astype() forces the column to assume a datatype it's been passed.
# Here, we are making sure ratings_df['timestamp'] is indeed all integers, otherwise that method would not work.
# pd.to_timestamp() has a parameter unit to reprsent what the integer is relative to epoch time. 's' represents seconds.

NameError: name 'ratings_df' is not defined

<font color=darkblue> How many unique users and movies are in this file?</font>

In [22]:
ratings_df[['userId']].nunique() # or ratings_df['userId'].nunique() 

NameError: name 'ratings_df' is not defined

In [23]:
ratings_df['movieId'].nunique()

NameError: name 'ratings_df' is not defined

<font color=darkblue> What are the unique usersIds?.</font>

In [24]:
ratings_df['userId'].unique() # returns a np array with unique userId

NameError: name 'ratings_df' is not defined

In [25]:
# Often we want to save the results of these operations. Saving here to a list.
unique_users = ratings_df['movieId'].unique().tolist() # without casting as a list, it would be a np array
len(unique_users)

NameError: name 'ratings_df' is not defined

<font color=darkblue> What is the average rating across all users & movies?</font>

In [26]:
ratings_df.columns

NameError: name 'ratings_df' is not defined

In [27]:
ratings_df['rating'].mean() #sum(), std(), median(), quantile([.25,.5,.75,.9])

NameError: name 'ratings_df' is not defined

<font color=darkblue> When was the earliest rating given, and also the latest?</font>

In [28]:
ratings_df['timestamp'].min()

NameError: name 'ratings_df' is not defined

In [29]:
ratings_df['timestamp'].max()

NameError: name 'ratings_df' is not defined

<font color=darkblue> Sort data by recency of rating, i.e., latest review at the top</font>

In [30]:
ratings_df.sort_values(by=['timestamp'],ascending=False) # by default ascending=True & inplace=False (so won't change the df)

NameError: name 'ratings_df' is not defined

<font color=darkblue> Which movie received the most number of ratings? </font>

In [31]:
ratings_df['movieId'].value_counts()[:1]
# value_counts already sorts in descending order. just get the first row.

NameError: name 'ratings_df' is not defined

In [32]:
# Slice the dataset to pull all reviews for movieId 356
ratings_df[ratings_df['movieId'] == 356]

NameError: name 'ratings_df' is not defined

In [33]:
# Slice the dataset further to retain only reviews received after year 2010
ratings_df[(ratings_df['movieId'] == 356) & (ratings_df['timestamp'].dt.year > 2010)] #dt.year extracts integer year portion

NameError: name 'ratings_df' is not defined

In [None]:
# Slice the dataset further to retain only reviews received after year 2010 and find out frequency of ratings
x = ratings_df[(ratings_df['movieId'] == 356) & (ratings_df['timestamp'].dt.year > 2010)]
x['rating'].value_counts()

In [None]:
# Slice the dataset further to retain only ratings with values 1 and 3
ratings_df[(ratings_df['movieId'] == 356) & 
           (ratings_df['timestamp'].dt.year > 2010) &
           (ratings_df['rating'].isin([1,3]))].reset_index(drop=True)

# .isin() checks if each element in the column/DataFrame is contained in values passed to the method

<hr style="height:2px; border-width:0; color:pink; background-color:pink">

## Visualizations

<font color=darkblue> Univariate distributions visualization </font>

In [None]:
import matplotlib.pyplot as plt

<font color=darkblue> Histogram</font>

In [None]:
# Plot a histogram of ratings
plt.figure()
ratings_df['rating'].plot.hist(bins=10)

<font color=darkblue> Boxplot</font>

In [None]:
ratings_df[['rating']].boxplot()

<div class="alert alert-block alert-success">

<b>For deeper dive: 
    <li>For static visualisations, Pandas & Matplotlib with Seaborn/Plotly is common.</li>
    <li>Dash & Bokeh are popular among interactive visualisation category.</li> 
    <li>Streamlit appears to be gaining traction in data apps. space.</li>
    </b> 
</div>

In [None]:
# # Recap
# .read_csv()
# .dtypes
# .info()
# .select_dtypes(include = "object")/.select_dtypes(exclude = "object")
# .describe(percentiles=[.1,.3])
# .quantile([.1,.3])
# .len()
# .shape
# .value_counts(normalize=True)
# .isna()
# .sum()
# .astype()
# .to_datetime(unit='s')
# .nunique()
# .unique()/.unique
# .columns
# .mean()
# .sum()
# .std()
# .median()
# .min()
# .max()
# .sort_values()
# .dt.year
# .isin()