# Webscraping Project -- The Expanse
By Eli Taylor

I will be using the Selenium package to web scrape data on character names from The Expanse book series from The Expanse Fandom Wiki. After cleaning the data, I'll be using the character names to create a character map for the series using natural language processing.

### Import Packages

In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

### Create Driver with Selenium

In [None]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [None]:
page_url = 'https://expanse.fandom.com/wiki/Category:Characters_(Books)'
driver.get(page_url)

### Get Character Names

In [44]:
#this tells the driver where to look for the character names
character_categories = driver.find_elements(By.CLASS_NAME, 'category-page__member-link')
character_categories

In [None]:
character_categories[1].text

In [None]:
#this tells the driver where to get the url for each character
characters = []
for character in character_categories:
    character_url = character.get_attribute('href')
    character_name = character.text
    characters.append({'character_name': character_name, 'url': character_url})

### Create Pandas DataFrame

In [None]:
characters_df = pd.DataFrame(characters)
characters_df

In [None]:
characters_df.character_name.value_counts()

### Cleaning up Character Names with Regex
Some of the character names include parentesis with extra information, for example, (Books) means that the character only appears in the books (and not in the TV series).

The driver also picked up a couple rows which are not character names, like "Characters that appear only in books", so I've dropped those.

In [None]:
#removing the text within paretheses
characters_df.character_name = characters_df.character_name.apply(lambda x: re.sub("[\(].*?[\)]", "", x))
characters_df['first_name'] = characters_df.character_name.apply(lambda x: x.split(' ', 1)[0])

In [None]:
pd.set_option('display.max_rows', None)
characters_df

In [None]:
#dropping rows that are not names of characters
characters_df = characters_df.drop(index=[0, 62, 63, 89]).reset_index()

In [None]:
characters_df.drop(columns=['index'], inplace=True)

In [None]:
characters_df

### Save the DataFrame to a CSV

In [None]:
characters_df.to_csv('expanse_characters.csv', index=False)

### Project Inspiration & Code Source

Thanks to Thu Vu data analytics for the inspiration and walkthrough for this project! Code has been adapted for use on data for The Expanse characters.
https://www.youtube.com/watch?v=RuNolAh_4bU