In [13]:
import pandas as pd

In [20]:
characters = pd.read_csv("../datasets/characters.csv")

We will start by taking an overview of the data:

In [21]:
characters.describe()

Unnamed: 0,name,birth,nationality,gender,height,weight,hair_color,eye_color,build,race,...,last_appeared_chapter,affiliation,title,rank,status,death,occupation,ajah,clan,sept
count,2285,225,2146,2285,334,30,651,373,579,40,...,1572,1235,408,857,2267,386,985,360,244,200
unique,2285,142,58,2,48,15,277,120,106,1,...,60,103,118,95,17,70,123,8,17,36
top,Kairen Stang,Age of Legends,Unknown nationality,Female,Tall,Slender,Dark,Dark,Plump,Ogier,...,Prologue,White Tower,Lord,Aes Sedai,Alive,1000 NE,Soldier,Green Ajah,Unknown clan,Unknown sept
freq,1,13,479,1186,174,7,70,70,66,40,...,125,317,102,362,1498,209,241,62,83,140


In [5]:
characters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2289 entries, 0 to 2288
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   name                     2285 non-null   object
 1   birth                    225 non-null    object
 2   nationality              2146 non-null   object
 3   gender                   2285 non-null   object
 4   height                   334 non-null    object
 5   weight                   30 non-null     object
 6   hair_color               651 non-null    object
 7   eye_color                373 non-null    object
 8   build                    579 non-null    object
 9   race                     40 non-null     object
 10  first_mentioned_book     883 non-null    object
 11  first_mentioned_chapter  790 non-null    object
 12  last_mentioned_book      908 non-null    object
 13  last_mentioned_chapter   859 non-null    object
 14  first_appeared_book      1629 non-null  

In [4]:
print(f'There are {len(characters.index)} characters.')

There are 2289 characters.


As of today (03/05/2020) The **Wheel of Time wiki** has the following character counts **1189** [Female](https://wot.fandom.com/wiki/Category:Women), **1101** [Male](https://wot.fandom.com/wiki/Category:Men) and no [Unknown Gender](https://wot.fandom.com/wiki/Category:Unknown_gender) characters. This would sum to a total of **2290** characters, but the the wiki actually is counting [Wolfbrothers](https://wot.fandom.com/wiki/Category:Wolfbrothers)' Category as a Male character so the total number of **2289** is the correct one, at least up to this date.

## Dropping Characters that Don't Have Bio Details on the Wiki

First we are going to look for characters that don't have Bio details in the wiki. Looking at the order of scraped characters I could find two of those characters in the wiki. 

* [Culan Cuhan](https://wot.fandom.com/wiki/Culan_Cuhan)
* [Anla the Wise Counselor](https://wot.fandom.com/wiki/Anla_the_Wise_Counselor)

However, as Scrapy does a lot of paralell requests, I couldn't spot the others.

In [22]:
characters[characters["name"].isna()]

Unnamed: 0,name,birth,nationality,gender,height,weight,hair_color,eye_color,build,race,...,last_appeared_chapter,affiliation,title,rank,status,death,occupation,ajah,clan,sept
279,,,,,,,,,,,...,,,,,,,,,,
314,,,,,,,,,,,...,,,,,,,,,,
1252,,,,,,,,,,,...,,,,,,,,,,
2204,,,,,,,,,,,...,,,,,,,,,,


As can be seen, there are four characters that don't have Bio in the wiki. Let's clean these rows.

In [23]:
indexes_drop = characters[characters["name"].isna()].index
characters.drop(indexes_drop, inplace=True)

In [24]:
print(f'We are left with {len(characters.index)} characters.')

We are left with 2285 characters.


## Dealing with Nationality Data

Some characters don't have the **nationality** filled in their Bios. I am going to fill them with the value **Unkown nationality** as this is already used  in the wiki when it is explicitly stated that the nationality is unkown for a character.

In [25]:
characters["nationality"].fillna("Unknown nationality", inplace=True)

Let's look at the count for each nationality:

In [26]:
characters["nationality"].value_counts()

Unknown nationality    618
Andoran                385
Aiel                   256
Seanchan               119
Cairhienin             116
Tairen                  88
Domani                  86
Altaran                 71
Saldaean                60
Kandori                 55
Murandian               50
Taraboner               45
Shienaran               37
Amadician               36
Illianer                36
Ghealdanin              34
Arafellin               33
Far Madding             21
Borderlander            15
Mayener                 14
Malkieri                10
Malkier                  9
Saldaea                  7
Manetherenite            6
Andor                    6
Tar Valon                5
Shandalle                5
Kandor                   5
Sharan                   5
Cairhien                 4
Arad Doman               3
Illian                   3
Murandy                  3
Aldesharin               3
Black Hills              3
Ghealdan                 2
Aridholin                2
S

Looking at the data we can see that there are some values that have different names but that actually correspond to the same nationality, such as **Andor** and **Andoran**. To avoid this duplication we will map them to get a uniform name for each nationality.

In [27]:
nationality_mapping = {
    'Unknown nationality':'Unknown nationality',
    'Andoran':'Andoran',
    'Aiel':'Aiel',
    'Seanchan':'Seanchan',
    'Cairhienin':'Cairhienin',
    'Tairen':'Tairen',
    'Domani':'Domani',
    'Altaran':'Altaran',
    'Saldaean':'Saldaean',
    'Kandori':'Kandori', 
    'Murandian':'Murandian',
    'Taraboner':'Taraboner',
    'Shienaran':'Shienaran',
    'Amadician':'Amadician',
    'Illianer':'Illianer',
    'Ghealdanin':'Ghealdanin',
    'Arafellin':'Arafellin',
    'Far Madding':'Far Madding',
    'Borderlander':'Borderlander',
    'Mayener':'Mayener',
    'Malkieri':'Malkieri',
    'Malkier':'Malkieri',
    'Saldaea':'Saldaean',
    'Andor': 'Andoran',
    'Manetherenite':'Manetherenite',
    'Tar Valon':'Tar Valon',
    'Shandalle':'Shandalle',
    'Kandor':'Kandori',
    'Sharan':'Sharan',
    'Cairhien':'Cairhienin',
    'Illian':'Illianer',
    'Murandy':'Murandian',
    'Aldesharin':'Aldesharin',
    'Arad Doman':'Domani',
    'Black Hills':'Black Hills',
    'Tarabon':'Taraboner',
    'Sea Folk':'Sea Folk',
    'Shienar':'Shienaran',
    'Arafel':'Arafellin',
    'Aramaellen':'Aramaellen',
    'Aridholin':'Aridholin',
    'Shiotan':'Shiotan',
    'Altara':'Altaran',
    'Mayene':'Mayener',
    'Ghealdan':'Ghealdanin',
    'Darmovanin':'Darmovanin',
    'Dal Calainin':'Dal Calainin',
    'Dorlan':'Dorlan',                  
    'Amadicia':'Amadician',
    'Essam':'Essam',
    'Jaramide':'Jaramide',
    'Aramaelle':'Aramaelle',
    'Amayarin':'Amayarin',
    'Talmouri':'Talmouri', 
    'Masenasharin':'Masenasharin',
    'Essenian':'Essenian',
    'Hol Cuchone':'Hol Cuchone',
    'Saferin':'Saferin'
}

characters['nationality'] = characters['nationality'].map(nationality_mapping)

In [28]:
characters["nationality"].value_counts()

Unknown nationality    618
Andoran                391
Aiel                   256
Cairhienin             120
Seanchan               119
Domani                  89
Tairen                  88
Altaran                 73
Saldaean                67
Kandori                 60
Murandian               53
Taraboner               47
Illianer                39
Shienaran               39
Amadician               37
Ghealdanin              36
Arafellin               35
Far Madding             21
Malkieri                19
Mayener                 16
Borderlander            15
Manetherenite            6
Sharan                   5
Shandalle                5
Tar Valon                5
Black Hills              3
Aldesharin               3
Aramaellen               2
Shiotan                  2
Sea Folk                 2
Aridholin                2
Jaramide                 1
Amayarin                 1
Aramaelle                1
Essam                    1
Hol Cuchone              1
Masenasharin             1
E

In [29]:
print(f'Now we end up with {characters["nationality"].nunique()} unique nationalities')

Now we end up with 43 unique nationalities


Something needs to be done about these characters. But we are going to take a deeper look at the books data in the next section and fix this problem, bear with me.


## Analyzing Book Mention and Appearences

Firstly we are going to see the book occurences and number of unique books for each of the three book attributes:

### First Mentioned

In [33]:
characters['first_mentioned_book'].value_counts()

LOC                            118
TSR                             97
TEOTW                           77
TOM                             69
TFOH                            60
NS                              53
ACOS                            51
TWORJTWOT                       48
KOD                             41
TGS                             38
COT                             38
TGH                             38
TDR                             36
WH                              35
TPOD                            24
RPG                             21
AMOL                            17
The Wheel of Time Companion     14
Wheel of Time Companion          6
TSASG                            2
Name: first_mentioned_book, dtype: int64

In [37]:
characters['first_mentioned_book'].nunique()

20

Things are looking good. There are 20 entries. The 14 books in the series, New Spring and 5 extras that actually correspond to the following 4 books 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **The Wheel of Time Companion** & **Wheel of Time Companion** (Both corresponde to the same book)
* **RPG** (The Wheel of Time Roleplaying Game)
* **TSASG** (The Strike at Shayol Ghul)


### Last Mentioned

In [40]:
characters['last_mentioned_book'].value_counts()

AMOL         115
TOM          109
KOD           84
LOC           80
TGS           79
TSR           71
COT           59
ACOS          49
TFOH          41
WH            41
TWORJTWOT     40
NS            37
TPOD          37
TGH           22
TDR           18
TEOTW         17
RPG            7
TSASG          2
Name: last_mentioned_book, dtype: int64

In [41]:
characters['last_mentioned_book'].nunique()

18

For last mentioned there are 18 different books. The 14 books in the series, New Spring and the following 3 books 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **RPG** (The Wheel of Time Roleplaying Game)
* **TSASG** (The Strike at Shayol Ghul)


### First Appeared

In [35]:
characters['first_appeared_book'].value_counts()

LOC          200
KOD          145
TSR          133
ACOS         131
TPOD         126
TOM          122
TFOH         117
NS           104
TEOTW        103
TGS           89
WH            85
TGH           79
COT           76
AMOL          61
TDR           53
CCG            3
TWORJTWOT      2
Name: first_appeared_book, dtype: int64

In [36]:
characters['first_appeared_book'].nunique()

17

Again, no problem. 17 entries. Again, 14 books in the series, New Spring and 2 extras that actually corresponde to 2 books: 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **CCG** (The Wheel of Time Collectible Card Game)


### Last Appeared

In [42]:
characters['last_appeared_book'].value_counts()

AMOL         255
KOD          222
TOM          209
TGS          137
LOC          131
TSR           99
TPOD          91
COT           76
WH            67
NS            64
TFOH          53
ACOS          51
TDR           42
TGH           38
TEOTW         37
CCG            3
TWORJTWOT      2
Name: last_appeared_book, dtype: int64

In [43]:
characters['last_appeared_book'].nunique()

17

Yet again 17 entries. The same 14 books in the series, New Spring and 2 extras that actually corresponde to 2 books: 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **CCG** (The Wheel of Time Collectible Card Game)

### Cleaning Characters Only Mentioned in Peripherical Material

I am going to clean the characters that don't have at least one of the three attributes related to one of the 15 books in the series (including New Spring). So that we remmain only with characters whose data indicates that appear or are mentioned in the the series.

In [44]:
series_books = ['NS', 'TEOTW', 'TGH', 'TDR', 'TSR','TFOH', 'LOC', 'ACOS','TPOD','WH', 'COT', 'KOD', 'TGS','TOM', 'AMOL']

In [45]:
condition = characters['first_mentioned_book'].isin(series_books) | characters['last_mentioned_book'].isin(series_books) | characters['first_appeared_book'].isin(series_books) | characters['last_appeared_book'].isin(series_books)  
characters = characters[condition]

In [46]:
print(f'We remain with {len(characters.index)} characters')

We remain with 2179 characters


In [47]:
characters.to_csv('../datasets/characters_cleaned.csv')

After the cleaning we remain with **2179** characters. For analysis I intend to do in the future this cleaning is sufficient. If, by some reason, I need some other column to be cleaned I will get to it in a later post. 

That's it for now folks! Hope you enjoyed this tour through the Wheel of Time data. Feel free to contact me by any reason.

Cheers!