# CSMODEL MACHINE PROJECT PHASE 1

## Phase 1: Dataset Description, Data Cleaning, and Research Question & Data Analysis

### 1. Dataset Description
- Provide a clear description of the dataset.
- Explain how the dataset was collected and discuss any potential biases or implications.
- Describe the structure of the dataset:
  - What each row and column represents
  - Number of observations
  - Attributes/features present
- Briefly explain each attribute (for structured data) or the nature of the observations (for unstructured data).

### 2. Data Cleaning
- Document the cleaning steps taken, such as:
  - Multiple representations of the same categorical value
  - Incorrect datatypes
  - Missing data
  - Duplicate data
  - Inconsistent formatting
  - Outliers
- For unstructured data, mention any transformation applied to convert it into a usable format.

### 3. Research Question & EDA
- Define a general research question relevant to your dataset.
- Conduct Exploratory Data Analysis (EDA) iteratively with the research question.
- Include at least 3 EDA questions and answer each using:
  - Summary statistics (e.g., mean, median, standard deviation)
  - Visualizations (e.g., histograms, box plots, distribution plots)

## Dataset Description

### About the Dataset
This dataset is a detailed record compiled by *The Washington Post* documenting every on-screen death in the **Game of Thrones** TV series across all eight seasons. It includes characters, background extras, and animals. When exact counts weren’t possible, educated estimations based on visual evidence were used.

### Data Collection and Methodology

**Criteria:**
- The character is killed on-screen.
- The character dies off-screen, but the death is confirmed or assumed due to imminent death while on screen.
- Only prominent off-screen deaths are listed. (Prominence is determined mainly by importance to the plot.)

**Key Methodological Notes:**
- Dragonfire and wildfire deaths were estimated by visible area of effect and troop density.
- Deaths are attributed to the direct killer, not the one who ordered the kill—unless the killer is unknown.
- Special treatment is given to undead (wights), resurrections, and ambiguous causes of death.

**Notes:**
- If a character orders the death of another, the character who does the direct killing receives credit, not the one who orders the kill. But for cases in which where the direct killer is unidentifiable, the order-giver receives credit.
- In cases of overlapping weapon types (e.g. magic fireball vs. fire vs. magic), the weapon category is assigned based on the origin. For example, dragonfire is considered an “animal” death and magic fireball is considered a “magic” death.
- If a character is mercy-killed, the mercy kill is used to categorize the death, not the injuries leading up to the moment.
  
### Dataset Structure

- **Total Observations:** 6,887 deaths
- **Total Attributes:** 11

The following are the descriptions of each variable in the dataset.
- **`order`**: A unique index for the chronological order of deaths.
- **`season`**: Season number (1 to 8) in which the death occurred.
- **`episode`**: Episode number within the respective season.
- **`character_killed`**: Name of the deceased character or creature.
- **`killer`**: Name of the killer (individual, creature, or group).
- **`method`**: Specific method of death (e.g., “Dragonfire”, “Sword (Ice)”).
- **`method_cat`**: General category of death method (e.g., “Animal”, “Blade”, “Magic”).
- **`reason`**: Explanation or motivation for the death (e.g., “Deserting the Night’s Watch”).
- **`location`**: Specific place where the death occurred (e.g., “Winterfell”).
- **`allegiance`**: Group, house, or faction the deceased belonged to.
- **`importance`**: Level of narrative importance (1 = Background extra, 4 = Major character).

## Data Cleaning
### This section explains all the procedures applied during the data cleaning process.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sets the theme of the charts
plt.style.use('ggplot')

%matplotlib inline

In [4]:
GOTdeaths_df = pd.read_csv('game-of-thrones-deaths-data.csv')

In [5]:
GOTdeaths_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6887 entries, 0 to 6886
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order             6887 non-null   int64  
 1   season            6887 non-null   int64  
 2   episode           6887 non-null   int64  
 3   character_killed  6887 non-null   object 
 4   killer            6410 non-null   object 
 5   method            6887 non-null   object 
 6   method_cat        6887 non-null   object 
 7   reason            6886 non-null   object 
 8   location          6887 non-null   object 
 9   allegiance        3136 non-null   object 
 10  importance        6886 non-null   float64
dtypes: float64(1), int64(3), object(7)
memory usage: 592.0+ KB


In [6]:
GOTdeaths_df.head()

Unnamed: 0,order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
0,1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2.0
1,2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2.0
2,3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2.0
3,4,1,1,Stag,Direwolf,Direwolf teeth,Animal,Unknown,Winterfell,,1.0
4,5,1,1,Direwolf,Stag,Antler,Animal,Unknown,Winterfell,,1.0


## Multiple representations of the same value

In [None]:
print(GOTdeaths_df['method'].unique())
print(GOTdeaths_df['method_cat'].unique())

method
Antler                        1
Arakh                        54
Arrow                       138
Axe                         102
Barrel                        7
                           ... 
Unknown (likely a sword)      1
Valyrian steel dagger         2
Warhammer                     4
Water (drowning)              1
Wildfire                    210
Length: 98, dtype: int64


## Handling missing data

In [8]:
GOTdeaths_df.isnull().sum()
GOTdeaths_df[GOTdeaths_df['killer'].isnull()]


Unnamed: 0,order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
60,61,2,1,The Silver,,Starvation,Other,No food or water in the Red Waste,Red Waste,"Dothraki, House Targaryen",2.0
191,192,3,2,Hoster Tully,,Illness,Other,,Riverrun,House Tully,2.0
196,197,3,4,Bannen,,Broken foot,Other,Critically wounded at the Battle of the Fist o...,Beyond the Wall,Night’s Watch,1.0
211,212,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
212,213,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
...,...,...,...,...,...,...,...,...,...,...,...
4490,4491,8,3,Wight,,Dragonglass barricade,Other,Killed during the Battle of Winterfell,Winterfell,,1.0
4491,4492,8,3,Wight,,Dragonglass barricade,Other,Killed during the Battle of Winterfell,Winterfell,,1.0
4492,4493,8,3,Wight,,Dragonglass barricade,Other,Killed during the Battle of Winterfell,Winterfell,,1.0
4493,4494,8,3,Wight,,Dragonglass barricade,Other,Killed during the Battle of Winterfell,Winterfell,,1.0


In [9]:
print(GOTdeaths_df.dtypes)

order                 int64
season                int64
episode               int64
character_killed     object
killer               object
method               object
method_cat           object
reason               object
location             object
allegiance           object
importance          float64
dtype: object


the criteria states that - Only prominent off-screen deaths are listed. (Prominence is determined mainly by importance to the plot.)
as such, wights with lowest importance are not included in the dataset.

In [10]:
GOTdeaths_df = GOTdeaths_df[
    ~((GOTdeaths_df['character_killed'].str.lower() == 'wight') & (GOTdeaths_df['killer'].isnull()) & (GOTdeaths_df['importance'] == 1.0))
]


In [11]:
GOTdeaths_df[GOTdeaths_df['killer'].isnull()]

Unnamed: 0,order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
60,61,2,1,The Silver,,Starvation,Other,No food or water in the Red Waste,Red Waste,"Dothraki, House Targaryen",2.0
191,192,3,2,Hoster Tully,,Illness,Other,,Riverrun,House Tully,2.0
196,197,3,4,Bannen,,Broken foot,Other,Critically wounded at the Battle of the Fist o...,Beyond the Wall,Night’s Watch,1.0
211,212,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
212,213,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
213,214,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
214,215,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
215,216,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
216,217,3,6,Wildling,,Falling,Falling,Fell while climbing the Wall,The Wall,Free Folk,1.0
398,399,4,9,Night’s Watch brother,,Burning oil,Fire/Burning,Killed during the Battle of Castle Black,The Wall,Night’s Watch,1.0


All entries with null 'killers' are now listed as No direct killer since the kill has no direct killer/order giver

In [12]:

GOTdeaths_df['killer'] = GOTdeaths_df['killer'].fillna('No direct killer')

null reason will be replaced with 'Unknown' as the data is not available

In [13]:
GOTdeaths_df[GOTdeaths_df['reason'].isnull()]
GOTdeaths_df.at[191, 'reason'] = 'Unknown'

All character deaths with null allegiance will be listed as 'Unknown'

In [14]:
GOTdeaths_df.isnull().sum()
GOTdeaths_df[GOTdeaths_df['allegiance'].isnull()]
GOTdeaths_df['allegiance'] = GOTdeaths_df['allegiance'].fillna('Unknown')

In [15]:
GOTdeaths_df[GOTdeaths_df['importance'].isnull()]

Unnamed: 0,order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
4477,4478,8,3,Wight,Brienne of Tarth,Sword (Oathkeeper),Blade,Killed during the Battle of Winterfell,Winterfell,Unknown,


All entries with null importance are removed from the dataset as they are not relevant for the analysis

In [16]:
GOTdeaths_df = GOTdeaths_df[GOTdeaths_df['importance'].notnull()]

## Fixing incorrect data types and values

In [17]:
GOTdeaths_df.dtypes

order                 int64
season                int64
episode               int64
character_killed     object
killer               object
method               object
method_cat           object
reason               object
location             object
allegiance           object
importance          float64
dtype: object

There are no conflicting data types in the DataFrame.

## Resolving duplicates and outliers

In [18]:
# Find duplicate 'order' values
duplicates = GOTdeaths_df[GOTdeaths_df.duplicated('order', keep=False)]
print(duplicates)

Empty DataFrame
Columns: [order, season, episode, character_killed, killer, method, method_cat, reason, location, allegiance, importance]
Index: []


there are no duplicates of character_killed in the data (Wildlings and Wights are not duplicates since they are background characters but still not the same)

In [19]:
print(GOTdeaths_df['character_killed'].unique())

['Waymar Royce' 'Gared' 'Will' 'Stag' 'Direwolf' 'Jon Arryn'
 'Dothraki man' 'Catspaw assassin' 'Mycah' 'Lady' 'Ser Hugh of the Vale'
 'Clegane’s horse' 'Stark soldier' 'Tribesman' 'Lannister soldier'
 'Jory Cassel' 'Wallen' 'Wildling' 'Stiv' 'Vardis Egen'
 'Viserys Targaryen' 'Robert Baratheon' 'Vayon Poole' 'Stark staff member'
 'Septa Mordane' 'Syrio Forel' 'Stableboy' 'Othor' 'Jafer Flowers' 'Mago'
 'Raven' 'Drogo’s horse' 'Qotho' 'Pigeon' 'Ned Stark' 'Rhaego'
 'Khal Drogo' 'Mirri Maz Duur' 'Knight' 'The Silver' 'Maester Cressen'
 'Barra (Baratheon illegitimate daughter)'
 'Robert Baratheon’s illegitimate son' 'Rakharo' 'Yoren'
 'Night’s Watch recruit' 'Lommy Greenhands' 'Rennick' 'Prisoner'
 'Renly Baratheon' 'Baratheon of Storm’s End guard' 'The Tickler'
 'Rodrik Cassel' 'High Septon' 'Peasant' 'Amory Lorch' 'Drennan' 'Irri'
 'Alton Lannister' 'Torrhen Karstark' 'The Spice King' 'The Silk King'
 'The Copper King' 'Member of the Thirteen' 'Billy' 'Jack'
 'Matthos Seaworth' 'Barath

In [22]:
importance_counts = GOTdeaths_df.groupby('importance').size()
print(importance_counts)

importance
1.0    6217
2.0      85
3.0      75
4.0      44
dtype: int64
