# Web Scraping TV Show Data from Various Sources for Data Analysis and Visualization using Python
##### By Wayne Omondi

### Introduction

***Web scraping*** is the process of extracting and parsing data from websites. It's a useful technique for creating your own datasets for research and learning. The scraping process involves 'downloading', parsing and processing HTML documents from our target pages.<br> The steps to take will be:

- Picking a websites and identifying the information to scrape from the site based on our objective(s).
- Using the requests library to 'get' web page(s) locally
- Inspecting the webpage's HTML source and knowing the tags that contains the information we seek.
- Using Beautiful Soup to parse (break into components) and extracting relevant information from the html document into a dataframe.
- Cleaning our data and data types, and some feature engineering where necessary.
- (optional)Exporting our scraped datasets into relevant CSV files.

For this project the target TV Show is **Criminal Minds**, one of my personal favourites. In my opinion, the show did 'fall-off' in the later seasons and I'd like to see if the data speaks to that and the overall data on the show and its perfomances during the seasons it aired off (16 in total).<br>

![!](images/Screenshot2022-10-25212751.png)

![!](images/Screenshot2022-10-25212823.png)

The data for the TV show will come from IMDB and Wikipedia. IMDB will include a Summary, Ratings and Votes for each episode, while the Wikipedia page will contain the Viewers (in millions) for each episodes: we will create a dataframe from both websites and then merge them into one dataset with all the data we need for analysis and vizualizations.

### 1.0: Libraries/Tools 

In [1]:
!pip install lxml --quiet
!pip install requests --quiet

In [2]:
#get() send a GET request to the specified url
#bs4 lib for pulling data out of HTML/XML files
#pandas for data processing and data manipulation

from requests import get 
from bs4 import BeautifulSoup 
import pandas as pd 

### 2.0: Data Collection

#### 2.1: Scraping Data from Wikipedia

In [3]:
#our first target website is wikipedia
wiki_url = 'https://en.wikipedia.org/wiki/List_of_Criminal_Minds_episodes' 

#list of criminalminds' episodes on wikipedia. the data we need here is in a table hence read_html() will be a great option

In [5]:
#using read_html() method to get the tabular data for the html doc
wiki_html = pd.read_html(wiki_url)

#view the first two rows of the first table for the html document
wiki_html[0].head(1) 


KeyboardInterrupt



In [None]:
#how many tables are in the doc
len(wiki_html) 

In [None]:
#iterate through all table elements to view them so we see the ones we want based on their index
for i, t in enumerate(wiki_html): 
    print("***********************************") #a separator between each table element
    
    #show the index and table
    print(i) 
    print(t)

Based on the table outputs, we want indices 1 to 15 which should cover seasons 1 to 15 of the show.<br> 
While at it we can see that the table with the last season (15) had a different column name than the previous. It is _'U.S. viewers (millions)'_ while the rest are 'US viewers (millions)'

In [None]:
#based on the above output season 15 of criminalminds is index 15
wiki_html[15] 

In [None]:
#rename the column and make change permanent in the dataframe
wiki_html[15].rename(columns={
    'U.S. viewers (millions)':'US viewers (millions)'}, inplace=True)

In [None]:
#iterate through the tables we need and append them
#empty list for our resulting data
cm_wiki_data = []

#range from season 1 to 15 (indices 1, 15)
for i in range(1,16):
    cm_wiki_data.append(wiki_html[i])

In [None]:
cm_wiki_df = pd.concat(cm_wiki_data)
cm_wiki_df

We now have all the relevant data from the wikipedia page compiled into a single dataset. 

#### 2.2: Scraping Data from IMDB

In [None]:
#list that will compose our dataframe
season_number_lst = []
episode_number_lst = []
episode_title_lst = []
episode_description_lst = []
imdb_rating_lst = []
imdb_votes_lst = []

In [None]:
#retrieving the html documents for each season's page from imdb
#criminal minds has 15 seasons
for season in range(15):
    season_number = season + 1
    #print(f'--Extracting Data for Season {season_number}')
    imdb_url = f'https://www.imdb.com/title/tt0452046/episodes?season={season_number}' 
    
    #each season as its own page hence the 'season=' with our variable
    imdb_response = get(imdb_url)
    
    #response.status_code - 200 is connection established 

    season_html = BeautifulSoup(imdb_response.content)
    season_info = season_html.findAll('div', attrs={
        'class':'info'})
    
    #retrieving on the relevant data from each season's retrieve html docs
    for episode_number, episode in enumerate(season_info):
        episode_title = episode.strong.a.text
        #print(f'episode title: {episode_title}')
        
        episode_description = episode.find(attrs={
            'class':'item_description'}).text
        #print(f'summary: {episode_description}')
        
        imdb_rating = episode.find(attrs={
            'class':'ipl-rating-star__rating'}).text
        #print(f'episode name: {imdb_rating}')
        
        imdb_votes = episode.find(attrs={
            'class':'ipl-rating-star__total-votes'}).text
        #print(f'votes on imdb: {imdb_votes}')
        
        #print(f'\n')
        
        season_number_lst.append(season_number)
        episode_number_lst.append(episode_number + 1)
        episode_title_lst.append(episode_title)
        episode_description_lst.append(episode_description)
        imdb_rating_lst.append(imdb_rating)
        imdb_votes_lst.append(imdb_votes)

In [None]:
#create a dataframe using our outputs
cm_imdb_df = pd.DataFrame({
        'season_number':season_number_lst,
        'episode_number':episode_number_lst,
        'episode_title':episode_title_lst,
        'episode_description':episode_description_lst,
        'imdb_rating':imdb_rating_lst,
        'imdb_votes':imdb_votes_lst
})

In [None]:
cm_imdb_df

### 3.0: Data Cleaning

We will use some string methods like .strip() and .replace() for remove punctuation marks from the values where none is needed.<br>
Drop some column(s)<br>
Converting string data type features into numeric features - for the features that we will need for calculations

#### 3.1: Cleaning the Wikipedia Data

In [None]:
#info on our first 2 table (season 1)
cm_wiki_df.info() 

In [None]:
cm_wiki_df.drop(columns = 'Prod. code', inplace=True)

In [None]:
#remove quotation marks for the Titles
cm_wiki_df.Title = cm_wiki_df['Title'].str.strip('""')

In [None]:
#clean 'US viewers (millions)' column
#get the first 4 characters
cm_wiki_df['US viewers (millions)'] = [x[:5] for x in cm_wiki_df['US viewers (millions)']]
cm_wiki_df['US viewers (millions)'] = cm_wiki_df['US viewers (millions)'].str.strip('[')

cm_wiki_df['US viewers (millions)'] = pd.to_numeric(cm_wiki_df['US viewers (millions)'])

In [None]:
cm_wiki_df.dtypes

In [None]:
cm_wiki_df.columns

In [None]:
cm_wiki_df.rename(columns={
    "Title": "episode_title",
    "No. overall":"episode_number_overall",
    "No. in season":"episode_number",
    "Directed by":"episode_director",
    "Written by":"episode_writer",
    "Original air date":"episode_airdate",
    "US viewers (millions)":"us_viewers_in_millions"
}, inplace=True)

#### 3.2: Cleaning the IMDB Data

In [None]:
#checking out data types
cm_imdb_df.info()

In [None]:
#clean up the string features
cm_imdb_df.episode_description = cm_imdb_df['episode_description'].str.strip()

In [None]:
#convert data types 
cm_imdb_df['imdb_votes'] = cm_imdb_df['imdb_votes'].str.strip('()').str.replace(",", "").astype(int)

cm_imdb_df.imdb_rating = pd.to_numeric(cm_imdb_df.imdb_rating)

In [None]:
cm_imdb_df

In [None]:
cm_imdb_df.dtypes

our imdb dataframe have 323 episodes, while the wikipedia dataframe has 324 episodes. 

In [None]:
#lets check the number of episodes per season in the imdb df
cm_imdb_df.groupby(['season_number'])['episode_number'].count()

In [None]:
#first table with seasons overview on wikipedia
wiki_html[0]

season 4 has 25 episodes in one df and 26 in another

In [None]:
wiki_html[4]

In [None]:
#view season 4 in the imdb df
cm_imdb_df[cm_imdb_df['season_number']==4]

As a fan of the show I remember that in the original airing, Episodes 25 & 26 of season was aired as one 2 hour long episode "To Hell...And Back". On wikipedia it is split as two separate episodes 25. To Hell & 26. And Back; while IMDB regards it as one.

In [None]:
#to delete the row that contains the split part in the wikipedia df
cm_wiki_df = cm_wiki_df[~cm_wiki_df.episode_name.str.contains("And Back")]

In [None]:
#then change episode name of episode 5 in the imdb df
cm_wiki_df['episode_name'] = cm_wiki_df['episode_name'].replace("To Hell","To Hell... And Back")

### 4.0: Combining Our Dataframes Into One

In [None]:
cm_imdb_df.head(2)

In [None]:
cm_wiki_df.head(2)