# Web Scraping TV-Show Data from Various Sources for Data Analysis and Visualization using Python
##### By Wayne Omondi

## Introduction

***Web scraping*** is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating your own datasets for research and learning. The scraping process involves 'downloading', parsing and processing HTML documents from our target pages.<br> The steps to take will be:

- Picking a websites and identifying the information to scrape from the site based on our objective(s).
- Using the requests library to 'get' web page(s) locally
- Inspecting the webpage's HTML source and knowing the tags that contains the information we seek.
- Using Beautiful Soup to parse (break into components) and extracting relevant information from the html document into a dataframe.
- Cleaning our data and data types, and some feature engineering where necessary.
- (optional)Exporting our scraped datasets into relevant CSV files.

For this project the target TV Show is **Criminal Minds**, one of my personal favourites. In my opinion, the show did 'fall-off' in the later seasons and I'd like to see if the data speaks to that and the overall data on the show and its perfomances during the seasons it aired off (16 in total).<br>
The data for the TV show will come from IMDB and Wikipedia. IMDB will include a Summary, Ratings and Votes for each episode, while the Wikipedia page will contain the Viewers (in millions) for each episodes: we will create a dataframe from both websites and then merge them into one dataset with all the data we need for analysis and vizualizations.

![!](images/Screenshot2022-10-25212751.png)

![!](images/Screenshot2022-10-25212823.png)

### 1.0: The 'Tools' We Need

In [None]:
!pip install lxml --quiet
!pip install requests --quiet

In [None]:
#get() send a GET request to the specified url
#bs4 lib for pulling data out of HTML/XML files
#pandas for data processing and data manipulation

from requests import get 
from bs4 import BeautifulSoup 
import pandas as pd 

### 2.0: Scraping Data from the First Website

In [None]:
#our first target website is wikipedia
wiki_url = 'https://en.wikipedia.org/wiki/List_of_Criminal_Minds_episodes' 

#list of criminalminds' episodes on wikipedia. the data we need here is in a table hence read_html() will be a great option

In [None]:
#using read_html() method to get the tabular data for the html doc
wiki_html = pd.read_html(wiki_url)

#view the first two rows of the first table for the html document
wiki_html[0].head(1) 

In [None]:
#how many tables are in the doc
len(wiki_html) 

In [None]:
#iterate through all table elements to view them so we see the ones we want based on their index
for i, t in enumerate(wiki_html): 
    print("***********************************") #a separator between each table element
    
    #show the index and table
    print(i) 
    print(t)

In [None]:
#based on the above output season 1 of criminalminds is index 1
wiki_html[1] 

In [None]:
# iterate through the tables we need and append them
# empty list for our resulting data
cm_wiki_data = []

# range from season 1 to 15 (indices 1, 15)
for i in range(1,16):
    cm_wiki_data.append(wiki_html[i])

In [None]:
cm_wiki_df = pd.concat(cm_wiki_data)
cm_wiki_df

We now have all the relevant data from the wikipedia page compiled into a single dataset. 

#### 2.1: Cleaning the Wikipedia Data

In [None]:
#info on our first 2 table (season 1)
cm_wiki_df.info() 

from the dataframe we can already some columns that will need  to be dropped, cleaned, and some that will need to be assigned a new datatypes, for example _'US viewers (millions)'_ and _'Original air date'_<br>
The last two columns also have to be merged into one column, the table with the last season (15) had a different column name than the previous

### 3.0: Scraping Data from IMDB

In [None]:
# list that will compose our dataframe
season_number_lst = []
episode_number_lst = []
episode_name_lst = []
episode_description_lst = []
episode_airdate_lst = []
imdb_rating_lst = []
imdb_votes_lst = []

In [None]:
# retrieving the html documents for each season's page from imdb
# criminal minds has 15 seasons
for season in range(15):
    season_number = season + 1
#    print(f'Extracting Data for Season {season_number}')
    imdb_url = f'https://www.imdb.com/title/tt0452046/episodes?season={season_number}' #each season as its own page hence the 'season=' with our variable
    imdb_response = get(imdb_url)
    
#   response.status_code - 200 is connection established 

    season_html = BeautifulSoup(imdb_response.content)
    season_info = season_html.findAll('div', attrs={'class':'info'})
    
# retrieving on the relevant data from each season's retrieve html docs
    for episode_number, episode in enumerate(season_info):
        episode_name = episode.strong.a.text
#        print(f'episode name: {episode_name}')
        
        episode_description = episode.find(attrs={'class':'item_description'})
#        print(f'summary: {episode_description}')
        
        episode_airdate = episode.find(attrs={'class':'airdate'}).text.strip()
#        print(f'episode aired on: {episode_airdate}')
        
        imdb_rating = episode.find(attrs={'class':'ipl-rating-star__rating'}).text
#        print(f'episode name: {imdb_rating}')
        
        imdb_votes = episode.find(attrs={'class':'ipl-rating-star__total-votes'}).text.strip('()')
#        print(f'votes on imdb: {imdb_votes}')
        
#        print(f'\n')
        
        season_number_lst.append(season_number)
        episode_number_lst.append(episode_number + 1)
        episode_name_lst.append(episode_name)
        episode_description_lst.append(episode_description)
        episode_airdate_lst.append(episode_airdate)
        imdb_rating_lst.append(imdb_rating)
        imdb_votes_lst.append(imdb_votes)

In [None]:
cm_imdb_df = pd.DataFrame({
        'season_number':season_number_lst,
        'episode_number':episode_number_lst,
        'episode_name':episode_name_lst,
        'episode_description':episode_description_lst,
        'imdb_rating':imdb_rating_lst,
        'imdb_votes':imdb_votes_lst
})

In [None]:
cm_imdb_df

#### 3.1: Cleaning the IMDB Data

In [None]:
# checking out data types
cm_imdb_df.info()

In [None]:
# remove the "," in 'imdb_votes' 
cm_imdb_df.imdb_votes = cm_imdb_df['imdb_votes'].str.replace(",","")

# converting the column data type to allow math operations
cm_imdb_df.imdb_votes = pd.to_numeric(cm_imdb_df.imdb_votes)

# converting the ratings column as well
cm_imdb_df.imdb_rating = pd.to_numeric(cm_imdb_df.imdb_rating)

In [None]:
cm_imdb_df.dtype