# Webscraping Project -- The Expanse
By Eli Taylor

I will be using the Selenium package to web scrape data on character names from The Expanse book series from The Expanse Fandom Wiki. After cleaning the data, I'll be using the character names to create a character map for the series using natural language processing.

### Import Packages

In [76]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import regex as re

### Create Driver with Selenium

In [77]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [78]:
page_url = 'https://expanse.fandom.com/wiki/Category:Characters_(Books)'
driver.get(page_url)

### Get Character Names

In [79]:
#this tells the driver where to look for the character names
character_categories = driver.find_elements(By.CLASS_NAME, 'category-page__member-link')
character_categories

[<selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="0b345e57-d830-4aed-9177-a7efbfe68c7c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="f02e5a1b-3ccb-45ce-adce-f96383d68cde")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="5963ec44-c5df-4c78-b01b-a7a1f4cd6843")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="708def5b-7a38-4ee7-b226-f357d1fca327")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="6be89131-5e82-4512-b70f-b2da7e010330")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="53f1583a-633c-4021-948e-7d97a12779e1")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="82db1822-9ecb-4e98-9f82-49

In [80]:
character_categories[1].text

'Aaman'

In [81]:
print(type(character_categories))

<class 'list'>


In [82]:
#this tells the driver where to get the url for each character
characters = []
for character in character_categories:
    character_url = character.get_attribute('href')
    character_name = character.text
    characters.append({'character_name': character_name, 'url': character_url})

### Getting names from page 2

In [83]:
page_2_url = 'https://expanse.fandom.com/wiki/Category:Characters_(Books)?from=Levy%2C+Ekko%0AEkko+Levy'
driver.get(page_2_url)

In [84]:
#this tells the driver where to look for the character names
character_categories_2 = driver.find_elements(By.CLASS_NAME, 'category-page__member-link')
character_categories_2

[<selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="5f79dfc3-b7af-4e90-96ed-3f92459ef9a7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="4a49b688-592d-4542-8da5-abc37f4deeae")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="2c8f7772-2a83-4228-aa32-54b86f61c833")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="1eb01ebf-7c7f-45f7-93db-88d93523f5e4")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="4fc2bcec-8458-4566-b889-636ccfab8111")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="805a9d2f-617f-4f7d-a452-a45282988731")>,
 <selenium.webdriver.remote.webelement.WebElement (session="22bbf140a1aa302b45d9cafc4d155396", element="65acf673-54b8-4312-904e-c0

In [85]:
#this tells the driver where to get the url for each character
for character in character_categories_2:
    character_url = character.get_attribute('href')
    character_name = character.text
    characters.append({'character_name': character_name, 'url': character_url})

### Create Pandas DataFrame

In [86]:
characters_df = pd.DataFrame(characters)
characters_df

Unnamed: 0,character_name,url
0,Characters renamed for TV,https://expanse.fandom.com/wiki/Characters_ren...
1,Aaman,https://expanse.fandom.com/wiki/Aaman
2,Aaron (Books),https://expanse.fandom.com/wiki/Aaron_(Books)
3,Abril (Books),https://expanse.fandom.com/wiki/Abril_(Books)
4,Ade Tukunbo (Books),https://expanse.fandom.com/wiki/Ade_Tukunbo_(B...
5,Adiki-Sandoval,https://expanse.fandom.com/wiki/Adiki-Sandoval
6,Adiyah (Books),https://expanse.fandom.com/wiki/Adiyah_(Books)
7,Admiral Milan,https://expanse.fandom.com/wiki/Admiral_Milan
8,Agnete,https://expanse.fandom.com/wiki/Agnete
9,Carl al-Dujaili (Books),https://expanse.fandom.com/wiki/Carl_al-Dujail...


### Cleaning up Character Names with Regex
Some of the character names include parentesis with extra information, for example, (Books) means that the character only appears in the books (and not in the TV series).

The driver also picked up a couple rows which are not character names, like "Characters that appear only in books", so I've dropped those.

In [87]:
#removing the text within paretheses
characters_df.character_name = characters_df.character_name.apply(lambda x: re.sub("[\(].*?[\)]", "", x))
characters_df['first_name'] = characters_df.character_name.apply(lambda x: x.split(' ', 1)[0])

In [88]:
pd.set_option('display.max_rows', None)
characters_df

Unnamed: 0,character_name,url,first_name
0,Characters renamed for TV,https://expanse.fandom.com/wiki/Characters_ren...,Characters
1,Aaman,https://expanse.fandom.com/wiki/Aaman,Aaman
2,Aaron,https://expanse.fandom.com/wiki/Aaron_(Books),Aaron
3,Abril,https://expanse.fandom.com/wiki/Abril_(Books),Abril
4,Ade Tukunbo,https://expanse.fandom.com/wiki/Ade_Tukunbo_(B...,Ade
5,Adiki-Sandoval,https://expanse.fandom.com/wiki/Adiki-Sandoval,Adiki-Sandoval
6,Adiyah,https://expanse.fandom.com/wiki/Adiyah_(Books),Adiyah
7,Admiral Milan,https://expanse.fandom.com/wiki/Admiral_Milan,Admiral
8,Agnete,https://expanse.fandom.com/wiki/Agnete,Agnete
9,Carl al-Dujaili,https://expanse.fandom.com/wiki/Carl_al-Dujail...,Carl


In [89]:
#dropping rows that are not names of characters
characters_df = characters_df.drop(index=[0, 41, 46, 62, 63, 89, 120, 203, 208, 238, 239, 269, 277, 278, 326, 337, 338, 354, 359]).reset_index()

In [90]:
characters_df.drop(columns=['index'], inplace=True)

In [92]:
characters_df.at[6,'first_name']='Milan'
characters_df.at[22,'first_name']='Father Anton'
characters_df.at[22,'character_name']='Father Anton'
characters_df.at[50,'first_name']='Burnham'
characters_df.at[113,'first_name']='Father Michel'
characters_df.at[140,'first_name']='Jim'
characters_df.at[227,'first_name']='Miller'

In [93]:
characters_df

Unnamed: 0,character_name,url,first_name
0,Aaman,https://expanse.fandom.com/wiki/Aaman,Aaman
1,Aaron,https://expanse.fandom.com/wiki/Aaron_(Books),Aaron
2,Abril,https://expanse.fandom.com/wiki/Abril_(Books),Abril
3,Ade Tukunbo,https://expanse.fandom.com/wiki/Ade_Tukunbo_(B...,Ade
4,Adiki-Sandoval,https://expanse.fandom.com/wiki/Adiki-Sandoval,Adiki-Sandoval
5,Adiyah,https://expanse.fandom.com/wiki/Adiyah_(Books),Adiyah
6,Admiral Milan,https://expanse.fandom.com/wiki/Admiral_Milan,Milan
7,Agnete,https://expanse.fandom.com/wiki/Agnete,Agnete
8,Carl al-Dujaili,https://expanse.fandom.com/wiki/Carl_al-Dujail...,Carl
9,Al-Farmi,https://expanse.fandom.com/wiki/Al-Farmi_(Books),Al-Farmi


### Save the DataFrame to a CSV

In [94]:
characters_df.to_csv('expanse_characters.csv', index=False)

### Project Inspiration & Code Source

Thanks to Thu Vu data analytics for the inspiration and walkthrough for this project (YouTube Video with walkthrough here: https://www.youtube.com/watch?v=RuNolAh_4bU)! Code has been adapted for use on data for The Expanse characters.

Check out The Expanse Fandom Wiki here: https://expanse.fandom.com/wiki/The_Expanse_Wiki