# Web Scraping TV-Show Data from Various Sources for Data Analysis and Visualization using Python
##### By Wayne Omondi

## Introduction

***Web scraping*** is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating your own datasets for research and learning. The scraping process involves 'downloading', parsing and processing HTML documents from our target pages.<br> The steps to take will be:

- Picking a websites and identifying the information to scrape from the site based on our objective(s).
- Using the requests library to 'get' web page(s) locally
- Inspecting the webpage's HTML source and knowing the tags that contains the information we seek.
- Using Beautiful Soup to parse (break into components) and extracting relevant information from the html document into a dataframe.
- Cleaning our data and data types, and some feature engineering where necessary.
- (optional)Exporting our scraped datasets into relevant CSV files.

For this project the target TV Show is **Criminal Minds**, one of my personal favourites. In my opinion, the show did 'fall-off' in the later seasons and I'd like to see if the data speaks to that and the overall data on the show and its perfomances during the seasons it aired off (16 in total).<br>
The data for the TV show will come from IMDB and Wikipedia. IMDB will include a Summary, Ratings and Votes for each episode, while the Wikipedia page will contain the Viewers (in millions) for each episodes: we will create a dataframe from both websites and then merge them into one dataset with all the data we need for analysis and vizualizations.

![!](images/Screenshot2022-10-25212751.png)

![!](images/Screenshot2022-10-25212823.png)

### 1.0: The 'Tools' We Need

In [1]:
!pip install lxml --quiet
!pip install requests --quiet

In [2]:
#get() send a GET request to the specified url
#bs4 lib for pulling data out of HTML/XML files
#pandas for data processing and data manipulation

from requests import get 
from bs4 import BeautifulSoup 
import pandas as pd 

### 2.0: Scraping Data from the First Website

In [3]:
#our first target website is wikipedia
wiki_url = 'https://en.wikipedia.org/wiki/List_of_Criminal_Minds_episodes' 

#list of criminalminds' episodes on wikipedia. the data we need here is in a table hence read_html() will be a great option

In [4]:
#using read_html() method to get the tabular data for the html doc
wiki_html = pd.read_html(wiki_url)

#view the first two rows of the first table for the html document
wiki_html[0].head(1) 

Unnamed: 0_level_0,Season,Episodes,Episodes,Originally aired,Originally aired,Rank,Rating
Unnamed: 0_level_1,Season,Episodes,Episodes.1,First aired,Last aired,Rank,Rating
0,1,22,22,"September 22, 2005","May 10, 2006",27,8.2


In [5]:
#how many tables are in the doc
len(wiki_html) 

20

In [6]:
#iterate through all table elements to view them so we see the ones we want based on their index
for i, t in enumerate(wiki_html): 
    print("***********************************") #a separator between each table element
    
    #show the index and table
    print(i) 
    print(t)

***********************************
0
   Season Episodes               Originally aired                    Rank  \
   Season Episodes Episodes.1         First aired         Last aired Rank   
0       1       22         22  September 22, 2005       May 10, 2006   27   
1       2       23         23  September 20, 2006       May 16, 2007   18   
2       3       20         20  September 26, 2007       May 21, 2008   18   
3       4       26         26  September 24, 2008       May 20, 2009   11   
4       5       23         23  September 23, 2009       May 26, 2010   14   
5       6       24         24  September 22, 2010       May 18, 2011   10   
6       7       24         24  September 21, 2011       May 16, 2012   13   
7       8       24         24  September 26, 2012       May 22, 2013   16   
8       9       24         24  September 25, 2013       May 14, 2014   13   
9      10       23         23     October 1, 2014        May 6, 2015    8   
10     11       22         22  Septemb

Based on the table outputs, we want indices 1 to 15 which should cover seasons 1 to 15 of the show.<br> 
While at it we can see that the table with the last season (15) had a different column name than the previous. It is _'U.S. viewers (millions)'_ while the rest are 'US viewers (millions)'

In [28]:
#based on the above output season 15 of criminalminds is index 15
wiki_html[15] 

Unnamed: 0,No. overall,No. in season,Title,Directed by,Written by,Original air date,Prod. code,U.S. viewers (millions)
0,315,1,"""Under the Skin""",Nelson McCormick,Christopher Barbour,"January 8, 2020",1416,4.82[311]
1,316,2,"""Awakenings""",Alec Smight,Stephanie Sengupta,"January 8, 2020",1417,4.49[311]
2,317,3,"""Spectator Slowing""",Kevin Berlandi,Bruce Zimmerman,"January 15, 2020",1418,4.58[312]
3,318,4,"""Saturday""",Edward Allen Bernero,Stephanie Birkitt & Breen Frazier,"January 22, 2020",1419,4.49[313]
4,319,5,"""Ghost""",Diana Valentine,Bobby Chacon & Jim Clemente,"January 29, 2020",1421,5.88[314]
5,320,6,"""Date Night""",Marcus Stokes,Breen Frazier,"February 5, 2020",1420,4.35[315]
6,321,7,"""Rusty""",Rachel Feldman,Erica Meredith & Erik Stiller,"February 5, 2020",1422,3.74[315]
7,322,8,"""Family Tree""",Alec Smight,Bruce Zimmerman,"February 12, 2020",1423,3.94[316]
8,323,9,"""Face Off""",Sharat Raju,Christopher Barbour,"February 19, 2020",1424,5.46[317]
9,324,10,"""And in the End""",Glenn Kershaw,Erica Messer & Kirsten Vangsness,"February 19, 2020",1425,5.36[317]


In [42]:
# rename the column and make change permanent in the dataframe
wiki_html[15].rename(columns={'U.S. viewers (millions)':'US viewers (millions)'}, inplace=True)

In [43]:
# iterate through the tables we need and append them
# empty list for our resulting data
cm_wiki_data = []

# range from season 1 to 15 (indices 1, 15)
for i in range(1,16):
    cm_wiki_data.append(wiki_html[i])

In [54]:
cm_wiki_df = pd.concat(cm_wiki_data)
cm_wiki_df

Unnamed: 0,No. overall,No. in season,Title,Directed by,Written by,Original air date,Prod. code,US viewers (millions)
0,1,1,"""Extreme Aggressor""",Richard Shepard,Jeff Davis,"September 22, 2005",101,19.57[2]
1,2,2,"""Compulsion""",Charles Haid,Jeff Davis,"September 28, 2005",102,10.57[3]
2,3,3,"""Won't Get Fooled Again""",Kevin Bray,Aaron Zelman,"October 5, 2005",103,11.98[4]
3,4,4,"""Plain Sight""",Matt Earl Beesley,Edward Allen Bernero,"October 12, 2005",104,13.76[5]
4,5,5,"""Broken Mirror""",Guy Norman Bee,Judith McCreary,"October 19, 2005",105,12.79[6]
...,...,...,...,...,...,...,...,...
5,320,6,"""Date Night""",Marcus Stokes,Breen Frazier,"February 5, 2020",1420,4.35[315]
6,321,7,"""Rusty""",Rachel Feldman,Erica Meredith & Erik Stiller,"February 5, 2020",1422,3.74[315]
7,322,8,"""Family Tree""",Alec Smight,Bruce Zimmerman,"February 12, 2020",1423,3.94[316]
8,323,9,"""Face Off""",Sharat Raju,Christopher Barbour,"February 19, 2020",1424,5.46[317]


We now have all the relevant data from the wikipedia page compiled into a single dataset. 

#### 2.1: Cleaning the Wikipedia Data

In [55]:
#info on our first 2 table (season 1)
cm_wiki_df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 324 entries, 0 to 9
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   No. overall            324 non-null    int64 
 1   No. in season          324 non-null    int64 
 2   Title                  324 non-null    object
 3   Directed by            324 non-null    object
 4   Written by             324 non-null    object
 5   Original air date      324 non-null    object
 6   Prod. code             324 non-null    int64 
 7   US viewers (millions)  324 non-null    object
dtypes: int64(3), object(5)
memory usage: 22.8+ KB


from the dataframe we can already some columns that will need  to be dropped, cleaned, and some that will need to be assigned a new datatypes, for example _'US viewers (millions)'_ and _'Original air date'_<br>

### 3.0: Scraping Data from IMDB

In [11]:
# list that will compose our dataframe
season_number_lst = []
episode_number_lst = []
episode_name_lst = []
episode_description_lst = []
episode_airdate_lst = []
imdb_rating_lst = []
imdb_votes_lst = []

In [12]:
# retrieving the html documents for each season's page from imdb
# criminal minds has 15 seasons
for season in range(15):
    season_number = season + 1
#    print(f'Extracting Data for Season {season_number}')
    imdb_url = f'https://www.imdb.com/title/tt0452046/episodes?season={season_number}' #each season as its own page hence the 'season=' with our variable
    imdb_response = get(imdb_url)
    
#   response.status_code - 200 is connection established 

    season_html = BeautifulSoup(imdb_response.content)
    season_info = season_html.findAll('div', attrs={'class':'info'})
    
# retrieving on the relevant data from each season's retrieve html docs
    for episode_number, episode in enumerate(season_info):
        episode_name = episode.strong.a.text
#        print(f'episode name: {episode_name}')
        
        episode_description = episode.find(attrs={'class':'item_description'}).text.strip()
#        print(f'summary: {episode_description}')
        
        episode_airdate = episode.find(attrs={'class':'airdate'}).text.strip()
#        print(f'episode aired on: {episode_airdate}')
        
        imdb_rating = episode.find(attrs={'class':'ipl-rating-star__rating'}).text
#        print(f'episode name: {imdb_rating}')
        
        imdb_votes = episode.find(attrs={'class':'ipl-rating-star__total-votes'}).text.strip('()')
#        print(f'votes on imdb: {imdb_votes}')
        
#        print(f'\n')
        
        season_number_lst.append(season_number)
        episode_number_lst.append(episode_number + 1)
        episode_name_lst.append(episode_name)
        episode_description_lst.append(episode_description)
        episode_airdate_lst.append(episode_airdate)
        imdb_rating_lst.append(imdb_rating)
        imdb_votes_lst.append(imdb_votes)

In [13]:
cm_imdb_df = pd.DataFrame({
        'season_number':season_number_lst,
        'episode_number':episode_number_lst,
        'episode_name':episode_name_lst,
        'episode_description':episode_description_lst,
        'imdb_rating':imdb_rating_lst,
        'imdb_votes':imdb_votes_lst
})

In [14]:
cm_imdb_df

Unnamed: 0,season_number,episode_number,episode_name,episode_description,imdb_rating,imdb_votes
0,1,1,Extreme Aggressor,[\nThe team travels to Seattle to find the cap...,7.8,2964
1,1,2,Compulsion,[\nThe team are called to an Arizona college w...,7.6,2479
2,1,3,Won't Get Fooled Again,[\nA bomber in Palm Beach forces Gideon to con...,7.6,2455
3,1,4,Plain Sight,[\nA serial rapist-killer active in San Diego ...,7.7,2234
4,1,5,Broken Mirror,[\nOne of a lawyer's identical twin daughters ...,7.9,2317
...,...,...,...,...,...,...
318,15,6,Date Night,"[\nAfter a father and daughter get kidnapped, ...",8.6,841
319,15,7,Rusty,[\nWhen the BAU team travels to Denver to inve...,7.1,549
320,15,8,Family Tree,[\nPrentiss and J.J. decide about their future...,7.3,523
321,15,9,Face Off,[\nIt has been a year since Rossi nearly died ...,8.2,566


#### 3.1: Cleaning the IMDB Data

In [16]:
# checking out data types
cm_imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   season_number        323 non-null    int64 
 1   episode_number       323 non-null    int64 
 2   episode_name         323 non-null    object
 3   episode_description  323 non-null    object
 4   imdb_rating          323 non-null    object
 5   imdb_votes           323 non-null    object
dtypes: int64(2), object(4)
memory usage: 15.3+ KB


In [17]:
# remove the "," in 'imdb_votes' 
cm_imdb_df.imdb_votes = cm_imdb_df['imdb_votes'].str.replace(",","")

# converting the column data type to allow math operations
cm_imdb_df.imdb_votes = pd.to_numeric(cm_imdb_df.imdb_votes)

# converting the ratings column as well
cm_imdb_df.imdb_rating = pd.to_numeric(cm_imdb_df.imdb_rating)

In [20]:
cm_imdb_df.dtypes

season_number            int64
episode_number           int64
episode_name            object
episode_description     object
imdb_rating            float64
imdb_votes               int64
dtype: object

our imdb dataframe have 323 episodes, while the wikipedia dataframe has 324 episodes.

In [79]:
# lets check the number of episodes per season in the imdb df
cm_imdb_df.groupby(['season_number'])['episode_number'].count()

season_number
1     22
2     23
3     20
4     25
5     23
6     24
7     24
8     24
9     24
10    23
11    22
12    22
13    22
14    15
15    10
Name: episode_number, dtype: int64

In [81]:
# first table with seasons overview on wikipedia
wiki_html[0]

Unnamed: 0_level_0,Season,Episodes,Episodes,Originally aired,Originally aired,Rank,Rating
Unnamed: 0_level_1,Season,Episodes,Episodes.1,First aired,Last aired,Rank,Rating
0,1,22,22,"September 22, 2005","May 10, 2006",27,8.2
1,2,23,23,"September 20, 2006","May 16, 2007",18,8.8
2,3,20,20,"September 26, 2007","May 21, 2008",18,8.2
3,4,26,26,"September 24, 2008","May 20, 2009",11,9.4
4,5,23,23,"September 23, 2009","May 26, 2010",14,8.5
5,6,24,24,"September 22, 2010","May 18, 2011",10,8.7
6,7,24,24,"September 21, 2011","May 16, 2012",13,8.6
7,8,24,24,"September 26, 2012","May 22, 2013",16,8.0
8,9,24,24,"September 25, 2013","May 14, 2014",13,8.2
9,10,23,23,"October 1, 2014","May 6, 2015",8,9.0


season 4 has 25 episodes in one df and 26 in another