In [1]:
import pandas as pd

In [30]:
characters = pd.read_csv("../datasets/characters.csv")

We will start by taking an overview of the data:

In [31]:
characters.describe()

Unnamed: 0,name,birth,nationality,gender,height,weight,hair_color,eye_color,build,race,...,rank,status,death,occupation,ajah,clan,sept,animal_type,color,owner
count,2378,226,2147,2372,339,30,652,373,580,40,...,858,2360,387,985,360,244,200,92,50,60
unique,2378,142,58,7,49,15,277,120,106,1,...,95,17,70,123,8,17,36,3,20,42
top,Seaine Herimon,Age of Legends,Unknown nationality,Female,Tall,Slender,Dark,Dark,Plump,Ogier,...,Aes Sedai,Alive,1000 NE,Soldier,Green Ajah,Unknown clan,Unknown sept,Horse,Bay,Rand al'Thor
freq,1,13,479,1193,176,7,70,70,66,40,...,362,1578,210,241,62,83,140,60,11,4


In [32]:
characters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2382 entries, 0 to 2381
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   name                     2378 non-null   object
 1   birth                    226 non-null    object
 2   nationality              2147 non-null   object
 3   gender                   2372 non-null   object
 4   height                   339 non-null    object
 5   weight                   30 non-null     object
 6   hair_color               652 non-null    object
 7   eye_color                373 non-null    object
 8   build                    580 non-null    object
 9   race                     40 non-null     object
 10  first_mentioned_book     886 non-null    object
 11  first_mentioned_chapter  793 non-null    object
 12  last_mentioned_book      913 non-null    object
 13  last_mentioned_chapter   863 non-null    object
 14  first_appeared_book      1718 non-null  

In [33]:
print(f'There are {len(characters.index)} characters.')

There are 2382 characters.


As of today (03/05/2020) The **Wheel of Time wiki** has the following character counts **1189** [Female](https://wot.fandom.com/wiki/Category:Women), **1101** [Male](https://wot.fandom.com/wiki/Category:Men) and no [Unknown Gender](https://wot.fandom.com/wiki/Category:Unknown_gender) characters. This would sum to a total of **2290** characters, but the the wiki actually is counting [Wolfbrothers](https://wot.fandom.com/wiki/Category:Wolfbrothers)' Category as a Male character so the total number of **2289** is the correct one.

**UPDATE**: Two characters don't appear in the list of Men they are:

* [Narg](https://wot.fandom.com/wiki/Narg)
* [Jain](https://wot.fandom.com/wiki/Jain_Farstriders)

So I crawled them separately.
I also crawled [Horses](https://wot.fandom.com/wiki/Category:Horses) (which contains **62** characters but counts [Dhurrans](https://wot.fandom.com/wiki/Wolf) and [Razor](https://wot.fandom.com/wiki/Razor) pages which are breeds of horses in WoT and not characters)  and [Wolves](https://wot.fandom.com/wiki/Category:Wolves) (which contain **32** characters but it counts the [Wofl](https://wot.fandom.com/wiki/Wolf) Page which isn't a character).

So the total of **2289 + 2 + 31 + 60 = 2382** is correct as of today.

## Dropping Characters that Don't Have Bio Details on the Wiki

First we are going to look for characters that don't have Bio details in the wiki. Looking at the order of scraped characters I could find two of those characters in the wiki. 

* [Culan Cuhan](https://wot.fandom.com/wiki/Culan_Cuhan)
* [Anla the Wise Counselor](https://wot.fandom.com/wiki/Anla_the_Wise_Counselor)

However, as Scrapy does a lot of paralell requests, I couldn't spot the others.

In [34]:
characters[characters["name"].isna()]

Unnamed: 0,name,birth,nationality,gender,height,weight,hair_color,eye_color,build,race,...,rank,status,death,occupation,ajah,clan,sept,animal_type,color,owner
279,,,,,,,,,,,...,,,,,,,,,,
314,,,,,,,,,,,...,,,,,,,,,,
1252,,,,,,,,,,,...,,,,,,,,,,
2204,,,,,,,,,,,...,,,,,,,,,,


As can be seen, there are four characters that don't have Bio in the wiki. Let's clean these rows.

In [35]:
indexes_drop = characters[characters["name"].isna()].index
characters.drop(indexes_drop, inplace=True)

In [36]:
print(f'We are left with {len(characters.index)} characters.')

We are left with 2378 characters.


## Dealing with Gender Data

Some characters don't have the **gender** filled in their Bios (only animals). I am going to fill them with the value **Unkown** as this is already used  in the wiki when it is explicitly stated that the gender is unknown for a character.

In [37]:
characters["gender"].fillna("Unknown", inplace=True)

In [38]:
characters["gender"].value_counts()

Female      1193
Male        1119
Unknown       22
Mare          20
Gelding       16
Stallion       7
Stalion        1
Name: gender, dtype: int64

We can see that different terms for Male (Stalion, Stallion and Gelding) and Female (Mare) horses are creating some noise. Let's uniform it:

In [39]:
gender_mapping = {
    'Male':'Male',
    'Stalion':'Male',
    'Stallion':'Male',
    'Gelding':'Male',
    'Female':'Female',
    'Mare':'Female',
}

characters['gender'] = characters['gender'].map(gender_mapping)

In [40]:
characters["gender"].value_counts()

Female    1213
Male      1143
Name: gender, dtype: int64

## Dealing with Nationality Data

Some characters don't have the **nationality** filled in their Bios. I am going to fill them with the value **Unkown nationality** as this is already used  in the wiki when it is explicitly stated that the nationality is unkown for a character.

In [41]:
characters["nationality"].fillna("Unknown nationality", inplace=True)

Let's look at the count for each nationality:

In [42]:
characters["nationality"].value_counts()

Unknown nationality    710
Andoran                385
Aiel                   256
Seanchan               119
Cairhienin             116
Tairen                  88
Domani                  86
Altaran                 71
Saldaean                60
Kandori                 55
Murandian               50
Taraboner               45
Shienaran               37
Amadician               36
Illianer                36
Ghealdanin              34
Arafellin               33
Far Madding             21
Borderlander            15
Mayener                 14
Malkieri                11
Malkier                  9
Saldaea                  7
Andor                    6
Manetherenite            6
Tar Valon                5
Sharan                   5
Shandalle                5
Kandor                   5
Cairhien                 4
Murandy                  3
Aldesharin               3
Illian                   3
Black Hills              3
Arad Doman               3
Arafel                   2
Aramaellen               2
G

Looking at the data we can see that there are some values that have different names but that actually correspond to the same nationality, such as **Andor** and **Andoran**. To avoid this duplication we will map them to get a uniform name for each nationality.

In [43]:
nationality_mapping = {
    'Unknown nationality':'Unknown',
    'Andoran':'Andoran',
    'Aiel':'Aiel',
    'Seanchan':'Seanchan',
    'Cairhienin':'Cairhienin',
    'Tairen':'Tairen',
    'Domani':'Domani',
    'Altaran':'Altaran',
    'Saldaean':'Saldaean',
    'Kandori':'Kandori', 
    'Murandian':'Murandian',
    'Taraboner':'Taraboner',
    'Shienaran':'Shienaran',
    'Amadician':'Amadician',
    'Illianer':'Illianer',
    'Ghealdanin':'Ghealdanin',
    'Arafellin':'Arafellin',
    'Far Madding':'Far Madding',
    'Borderlander':'Borderlander',
    'Mayener':'Mayener',
    'Malkieri':'Malkieri',
    'Malkier':'Malkieri',
    'Saldaea':'Saldaean',
    'Andor': 'Andoran',
    'Manetherenite':'Manetherenite',
    'Tar Valon':'Tar Valon',
    'Shandalle':'Shandalle',
    'Kandor':'Kandori',
    'Sharan':'Sharan',
    'Cairhien':'Cairhienin',
    'Illian':'Illianer',
    'Murandy':'Murandian',
    'Aldesharin':'Aldesharin',
    'Arad Doman':'Domani',
    'Black Hills':'Black Hills',
    'Tarabon':'Taraboner',
    'Sea Folk':'Sea Folk',
    'Shienar':'Shienaran',
    'Arafel':'Arafellin',
    'Aramaellen':'Aramaellen',
    'Aridholin':'Aridholin',
    'Shiotan':'Shiotan',
    'Altara':'Altaran',
    'Mayene':'Mayener',
    'Ghealdan':'Ghealdanin',
    'Darmovanin':'Darmovanin',
    'Dal Calainin':'Dal Calainin',
    'Dorlan':'Dorlan',                  
    'Amadicia':'Amadician',
    'Essam':'Essam',
    'Jaramide':'Jaramide',
    'Aramaelle':'Aramaelle',
    'Amayarin':'Amayarin',
    'Talmouri':'Talmouri', 
    'Masenasharin':'Masenasharin',
    'Essenian':'Essenian',
    'Hol Cuchone':'Hol Cuchone',
    'Saferin':'Saferin'
}

characters['nationality'] = characters['nationality'].map(nationality_mapping)

In [44]:
characters["nationality"].value_counts()

Unknown          710
Andoran          391
Aiel             256
Cairhienin       120
Seanchan         119
Domani            89
Tairen            88
Altaran           73
Saldaean          67
Kandori           60
Murandian         53
Taraboner         47
Shienaran         39
Illianer          39
Amadician         37
Ghealdanin        36
Arafellin         35
Far Madding       21
Malkieri          20
Mayener           16
Borderlander      15
Manetherenite      6
Tar Valon          5
Sharan             5
Shandalle          5
Black Hills        3
Aldesharin         3
Shiotan            2
Aridholin          2
Sea Folk           2
Aramaellen         2
Essenian           1
Talmouri           1
Saferin            1
Darmovanin         1
Amayarin           1
Jaramide           1
Dorlan             1
Essam              1
Hol Cuchone        1
Masenasharin       1
Aramaelle          1
Dal Calainin       1
Name: nationality, dtype: int64

In [45]:
print(f'Now we end up with {characters["nationality"].nunique()} unique nationalities')

Now we end up with 43 unique nationalities


Something needs to be done about these characters. But we are going to take a deeper look at the books data in the next section and fix this problem, bear with me.


## Analyzing Book Mention and Appearences

Firstly we are going to see the book occurences and number of unique books for each of the three book attributes:

### First Mentioned

In [46]:
characters['first_mentioned_book'].value_counts()

LOC                            118
TSR                             97
TEOTW                           78
TOM                             69
TFOH                            60
NS                              53
ACOS                            51
TWORJTWOT                       48
KOD                             41
TGS                             38
COT                             38
TGH                             38
TDR                             37
WH                              36
TPOD                            24
RPG                             21
AMOL                            17
The Wheel of Time Companion     14
Wheel of Time Companion          6
TSASG                            2
Name: first_mentioned_book, dtype: int64

In [47]:
characters['first_mentioned_book'].nunique()

20

Things are looking good. There are 20 entries. The 14 books in the series, New Spring and 5 extras that actually correspond to the following 4 books 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **The Wheel of Time Companion** & **Wheel of Time Companion** (Both corresponde to the same book)
* **RPG** (The Wheel of Time Roleplaying Game)
* **TSASG** (The Strike at Shayol Ghul)


### Last Mentioned

In [48]:
characters['last_mentioned_book'].value_counts()

AMOL         117
TOM          112
KOD           84
LOC           80
TGS           79
TSR           71
COT           59
ACOS          49
TFOH          41
WH            41
TWORJTWOT     40
NS            37
TPOD          37
TGH           22
TDR           18
TEOTW         17
RPG            7
TSASG          2
Name: last_mentioned_book, dtype: int64

In [49]:
characters['last_mentioned_book'].nunique()

18

For last mentioned there are 18 different books. The 14 books in the series, New Spring and the following 3 books 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **RPG** (The Wheel of Time Roleplaying Game)
* **TSASG** (The Strike at Shayol Ghul)


### First Appeared

In [50]:
characters['first_appeared_book'].value_counts()

LOC          210
KOD          156
TSR          136
ACOS         134
TPOD         133
TOM          133
TFOH         121
TEOTW        112
NS           107
WH            91
TGS           90
COT           85
TGH           82
AMOL          65
TDR           58
CCG            3
TWORJTWOT      2
Name: first_appeared_book, dtype: int64

In [51]:
characters['first_appeared_book'].nunique()

17

Again, no problem. 17 entries. Again, 14 books in the series, New Spring and 2 extras that actually corresponde to 2 books: 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **CCG** (The Wheel of Time Collectible Card Game)


### Last Appeared

In [52]:
characters['last_appeared_book'].value_counts()

AMOL         261
KOD          223
TOM          215
TGS          139
LOC          132
TSR           99
TPOD          92
COT           77
WH            67
NS            66
TFOH          54
ACOS          51
TDR           43
TGH           38
TEOTW         38
CCG            3
TWORJTWOT      2
Name: last_appeared_book, dtype: int64

In [53]:
characters['last_appeared_book'].nunique()

17

Yet again 17 entries. The same 14 books in the series, New Spring and 2 extras that actually corresponde to 2 books: 

* **TWORJTWOT** (The World of Robert Jordan's The Wheel of Time)
* **CCG** (The Wheel of Time Collectible Card Game)

### Cleaning Characters Only Mentioned in Peripherical Material

I am going to clean the characters that don't have at least one of the three attributes related to one of the 15 books in the series (including New Spring). So that we remmain only with characters whose data indicates that appear or are mentioned in the the series.

In [54]:
series_books = ['NS', 'TEOTW', 'TGH', 'TDR', 'TSR','TFOH', 'LOC', 'ACOS','TPOD','WH', 'COT', 'KOD', 'TGS','TOM', 'AMOL']

In [55]:
condition = characters['first_mentioned_book'].isin(series_books) | characters['last_mentioned_book'].isin(series_books) | characters['first_appeared_book'].isin(series_books) | characters['last_appeared_book'].isin(series_books)  
characters = characters[condition]

In [56]:
print(f'We remain with {len(characters.index)} characters')

We remain with 2272 characters


In [57]:
characters.to_csv('../datasets/characters_cleaned.csv', index=False)

After the cleaning we remain with **2179** characters. For analysis I intend to do in the future this cleaning is sufficient. If, by some reason, I need some other column to be cleaned I will get to it in a later post. 

That's it for now folks! Hope you enjoyed this tour through the Wheel of Time data. Feel free to contact me by any reason.

Cheers!