# Data Munging IMDB website

Data Munging (or Data Wrangling), it means preparing your data for a dedicated purpose - taking the data from its raw state and transforming and mapping into another format, normally for use beyond its original intent.

This notebook expects an IMDB.csv file created from web scraping.

In [1]:
import os
os.path.isfile('IMDB.csv')

True

In [2]:
import pandas as pd
df = pd.read_csv('IMDB.csv',encoding='latin-1');

In [3]:
print(df[0:2])

   Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   

                                         Description      Director  \
0  A group of intergalactic criminals are forced ...    James Gunn   
1  Following clues to the origin of mankind, a te...  Ridley Scott   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121   
1  Noomi Rapace, Logan Marshall-Green, Michael Fa...  2012                124   

   Rating   Votes  Revenue (Millions)  Metascore  
0     8.1  757074                 333         76  
1     7.0  485820                 126         65  


In [4]:
df.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333,76
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126,65
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138,62


Notice that df.head(3) also formats

In [5]:
df.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,2012.783,113.172,6.7232,169808.3,80.029,59.818
std,288.819436,3.205962,18.810908,0.945429,188762.6,96.638093,16.937407
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,17.0,47.75
50%,500.5,2014.0,111.0,6.8,110799.0,60.0,61.0
75%,750.25,2016.0,123.0,7.4,239909.8,98.5,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.0,100.0


In [7]:
df.dtypes

Rank                    int64
Title                  object
Genre                  object
Description            object
Director               object
Actors                 object
Year                    int64
Runtime (Minutes)       int64
Rating                float64
Votes                   int64
Revenue (Millions)      int64
Metascore               int64
dtype: object

In [8]:
df.isnull().sum()

Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64

In [9]:
genre = df['Genre'] # NB genre = df['genre'] is an error

In [10]:
genre.head()

0     Action,Adventure,Sci-Fi
1    Adventure,Mystery,Sci-Fi
2             Horror,Thriller
3     Animation,Comedy,Family
4    Action,Adventure,Fantasy
Name: Genre, dtype: object

In [11]:
genre_s = genre.str.split(',', expand=True)

In [12]:
genre_s.head()

Unnamed: 0,0,1,2
0,Action,Adventure,Sci-Fi
1,Adventure,Mystery,Sci-Fi
2,Horror,Thriller,
3,Animation,Comedy,Family
4,Action,Adventure,Fantasy


In [13]:
genres={}
for index, row in genre_s.iterrows():
  for c in row:
    if c in genres:  
      genres[c]+=1
    else:
      genres[c]=1   

In [14]:
genres

{'Action': 303,
 'Adventure': 259,
 'Sci-Fi': 120,
 'Mystery': 106,
 'Horror': 119,
 'Thriller': 195,
 None: 445,
 'Animation': 49,
 'Comedy': 279,
 'Family': 51,
 'Fantasy': 101,
 'Drama': 513,
 'Music': 16,
 'Biography': 81,
 'Romance': 141,
 'History': 29,
 'Crime': 150,
 'Western': 7,
 'War': 13,
 'Musical': 5,
 'Sport': 18}

In [15]:
genres_df=pd.DataFrame.from_dict(genres,orient='index')

In [16]:
genres_df.head()

Unnamed: 0,0
Action,303
Adventure,259
Sci-Fi,120
Mystery,106
Horror,119


### Converting the data frames to the csv.
by using to_csv function in python

In [17]:
genres_df.to_csv("IMDB_genres.csv")