## Background / Motivation

Throughout the STAT 303-2 class, we explored several modelling questions related to big-picture topics such as housing and bank finance. However, for our final project, we were interested in looking at how regression analysis could apply to questions that come up in our daily lives and conversations. For example, what do you think is going to happen next in your favorite book? 

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

We created a model to see if we could accurately predict whether Game of Thrones characters were dead or alive by the end of fifth book. The objective was prediction. By seeing if we could make a model that accurately predicts whether the characters survive or not, we could make inferences based on the best predictors to see what influences a character's chances of survival.

In [None]:
# can we change this a bit? we don't address inference much - might be better/safer to go with prediction

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

We are using character-predictions.csv from the following Game of Thrones' dataset on Kaggle. The dataframe contains 1946 observations, but after filtering out NaN values for the columns we used as predictors, there were 493 characters with no missing data. The outcome we tried to predict was the column, isAlive. The predictors we used to do so were house, culture, male, and isNoble.

https://www.kaggle.com/datasets/mylesoneill/game-of-thrones?select=battles.csv

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

For the past decade, Game of Thrones readers have been waiting for the next installment in the series to be published. The most recent book ended with several cliffhangers, so fans have been left speculating what will happen next. A model with insight into what makes a person in the GOT universe more likely to die would be interesting to readers concerned about their favorite characters. With the presence of fandom communities online, especially for blockbusters such as Game of Thrones, there would be a strong appetite for content delving into the world. Although a regression model about the series would not be the same as George R. R. Martin's next novel, a nerdy analysis about the books could help tide fans over until the release date finally arrives. 

## Data quality check / cleaning / preparation

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

In [11]:
import pandas as pd
characters = pd.read_csv("data/character-predictions_pose.csv")
display(characters.describe())

Unnamed: 0,S.No,plod,male,dateOfBirth,DateoFdeath,book1,book2,book3,book4,book5,...,isAliveHeir,isAliveSpouse,isMarried,isNoble,age,numDeadRelations,boolDeadRelations,isPopular,popularity,isAlive
count,1946.0,1946.0,1946.0,433.0,444.0,1946.0,1946.0,1946.0,1946.0,1946.0,...,23.0,276.0,1946.0,1946.0,433.0,1946.0,1946.0,1946.0,1946.0,1946.0
mean,973.5,0.36553,0.619219,1577.364896,2950.193694,0.198356,0.374615,0.480473,0.591984,0.39517,...,0.652174,0.778986,0.141829,0.460946,-1293.56351,0.305755,0.074512,0.059096,0.089584,0.745632
std,561.906131,0.312637,0.485704,19565.41446,28192.245529,0.398864,0.484148,0.499747,0.491593,0.489013,...,0.486985,0.415684,0.348965,0.498601,19564.340993,1.38391,0.262669,0.235864,0.160568,0.435617
min,1.0,0.0,0.0,-28.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-298001.0,0.0,0.0,0.0,0.0,0.0
25%,487.25,0.101,0.0,240.0,282.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,18.0,0.0,0.0,0.0,0.013378,0.0
50%,973.5,0.2645,1.0,268.0,299.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,27.0,0.0,0.0,0.0,0.033445,1.0
75%,1459.75,0.60875,1.0,285.0,299.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,50.0,0.0,0.0,0.0,0.086957,1.0
max,1946.0,1.0,1.0,298299.0,298299.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,100.0,15.0,1.0,1.0,1.0,1.0


In [12]:
characters.house.value_counts()

Night's Watch      105
House Frey          97
House Stark         72
House Targaryen     62
House Lannister     49
                  ... 
House Gower          1
House Borrell        1
Citadel              1
Wise Masters         1
Three-eyed crow      1
Name: house, Length: 347, dtype: int64

In [14]:
characters.house.unique()

array([nan, 'House Frey', 'House Swyft', 'House Arryn', 'House Santagar',
       'House Targaryen', 'House Osgrey', "Night's Watch", 'House Humble',
       'House Wylde', 'House Wode', 'House Fell',
       'Brotherhood Without Banners', 'House Webber', 'House Greyjoy',
       'House Stark', 'House Waynwood', 'House Dayne', 'House Manderly',
       'House Farwynd of the Lonely Light', 'Happy Port',
       'House of Loraq', 'Kingswood Brotherhood', 'House Botley',
       'Burned Men', 'House Velaryon', 'House Tallhart', 'House Tyrell',
       'House Blackwood', 'House Blackfyre', 'wildling',
       'Kingdom of the Three Daughters',
       'House Royce of the Gates of the Moon', 'House Nayland',
       "House Vance of Wayfarer's Rest", 'House Rowan', 'House Farrow',
       'House Lonmouth', 'House Reyne', 'House Ashford', 'House Brax',
       'House Paege', 'House Hollard', 'House Tarth', 'House Ryswell',
       'House Lannister', 'House Crakehall', 'House Darklyn',
       'House Westerli

In [13]:
characters.culture.value_counts()

Northmen     124
Ironborn     112
Free Folk     51
Valyrian      43
Braavosi      42
            ... 
Andal          1
Norvoshi       1
Qarth          1
Lhazarene      1
The Reach      1
Name: culture, Length: 64, dtype: int64

In [15]:
characters.culture.unique()

array([nan, 'Rivermen', 'Dornish', 'Valyrian', 'Ironborn', 'Free Folk',
       'Northmen', 'Summer Isles', 'Braavosi', 'Dothraki', 'Ghiscari',
       'Vale mountain clans', 'Reach', 'Tyroshi', 'Lhazarene',
       'Free folk', 'Ironmen', 'Qartheen', 'Lysene', 'westermen',
       'Westerman', 'Qarth', 'Lyseni', 'northmen', 'Qohor', 'Westeros',
       'Norvoshi', 'First Men', 'Meereenese', 'Andal', 'Astapori',
       'Westermen', 'ironborn', 'Ghiscaricari', 'Braavos', 'Stormlands',
       'Valemen', 'Myrish', 'Lhazareen', 'Dornishmen', 'Sistermen',
       'Northern mountain clans', 'Andals', 'Vale', 'Crannogmen',
       'Wildling', 'Dorne', 'Pentoshi', 'free folk', 'Summer Islander',
       'Westerlands', 'Summer Islands', 'Asshai', 'Riverlands', 'Naathi',
       'Rhoynar', 'Meereen', 'Norvos', 'Stormlander', 'Wildlings',
       'Astapor', 'Reachmen', "Asshai'i", 'Ibbenese', 'The Reach'],
      dtype=object)

Based on these distributions, we kept selected comments that we felt would be relevant to analysis.

In [16]:
characters = characters.loc[:,['name', 'male', 'house', 'isNoble', 'numDeadRelations', 'popularity', 'isAlive', 'culture', 'boolDeadRelations', 'isPopular']]

Then, we deleted observations with any missing values for any of the selected variables.

In [17]:
characters = characters[~characters.isnull().any(axis=1)]
characters.reset_index(inplace = True, drop = True)

This left us with 493 observations.

In [18]:
characters

Unnamed: 0,name,male,house,isNoble,numDeadRelations,popularity,isAlive,culture,boolDeadRelations,isPopular
0,Walder Frey,1,House Frey,1,1,0.896321,1,Rivermen,1,1
1,Sylva Santagar,0,House Santagar,1,0,0.043478,1,Dornish,0,0
2,Valarr Targaryen,1,House Targaryen,1,0,0.431438,0,Valyrian,0,1
3,Will Humble,1,House Humble,0,0,0.013378,1,Ironborn,0,0
4,Wulfe,1,House Greyjoy,0,0,0.023411,1,Ironborn,0,0
...,...,...,...,...,...,...,...,...,...,...
488,Tarle,1,Drowned men,0,0,0.026756,1,Ironborn,0,0
489,Gormond Goodbrother,1,House Goodbrother,0,0,0.040134,1,Ironborn,0,0
490,Walder Rivers,1,House Frey,1,0,0.080268,1,Rivermen,0,0
491,Laena Velaryon,0,House Velaryon,0,0,0.140468,0,Valyrian,0,0


Since there were many different value for each house and culture, binning each house and culture by survival rates would increase their predictive ability. 

In [19]:
for house in characters.house:
    percent_alive = characters[characters.house == house].isAlive.mean()
    characters.loc[characters.house == house, 'house_alive'] = percent_alive

# Bin house_alive into 5 bins
binned_house_alive = pd.qcut(characters['house_alive'],10,retbins=True, duplicates = 'drop')
bins = binned_house_alive[1]
characters['house_alive_binned'] = pd.cut(characters['house_alive'],bins = bins)
dum = pd.get_dummies(characters.house_alive_binned,drop_first = True)
dum.columns = ['house_alive'+str(x) for x in range(1,len(bins)-1)]
characters = pd.concat([characters,dum], axis = 1)

For cultures, there were many names that corresponded to the same culture. First, these were combined.  

In [20]:
characters.culture = characters.culture.replace(to_replace = "northmen", value = "Northmen")
characters.culture = characters.culture.replace(to_replace = "ironborn", value = "Ironborn")
characters.culture = characters.culture.replace(to_replace = "Ironmen", value = "Ironborn")
characters.culture = characters.culture.replace(to_replace = "Asshai'i", value = "Asshai")
characters.culture = characters.culture.replace(to_replace = "Free folk", value = "Free Folk")
characters.culture = characters.culture.replace(to_replace = "free folk", value = "Free Folk")
characters.culture = characters.culture.replace(to_replace = "Summer Islands", value = "Summer Isles")
characters.culture = characters.culture.replace(to_replace = "Summer Islander", value = "Summer Isles")
characters.culture = characters.culture.replace(to_replace = "westermen", value = "Westermen")
characters.culture = characters.culture.replace(to_replace = "Westerman", value = "Westermen")
characters.culture = characters.culture.replace(to_replace = "Westerlands", value = "Westermen")
characters.culture = characters.culture.replace(to_replace = "Vale", value = "Valemen")
characters.culture = characters.culture.replace(to_replace = "Lhazareen", value = "Lhazarene")
characters.culture = characters.culture.replace(to_replace = "The Reach", value = "Reach")
characters.culture = characters.culture.replace(to_replace = "Reachmen", value = "Reach")
characters.culture = characters.culture.replace(to_replace = "Qarth", value = "Qartheen")
characters.culture = characters.culture.replace(to_replace = "Lyseni", value = "Lysene")
characters.culture = characters.culture.replace(to_replace = "Stormlander", value = "Stormlands")
characters.culture = characters.culture.replace(to_replace = "Meereenese", value = "Meereen")
characters.culture = characters.culture.replace(to_replace = "Astapor", value = "Astapori")
characters.culture = characters.culture.replace(to_replace = "Norvos", value = "Norvoshi")
characters.culture = characters.culture.replace(to_replace = "Wildlings", value = "Wildling")
characters.culture = characters.culture.replace(to_replace = "Andals", value = "Andal")
characters.culture = characters.culture.replace(to_replace = "Braavos", value = "Braavosi")
characters.culture = characters.culture.replace(to_replace = "Dorne", value = "Dornish")
characters.culture = characters.culture.replace(to_replace = "Dornishmen", value = "Dornish")
characters.culture = characters.culture.replace(to_replace = "Ghiscaricari", value = "Ghiscari")

Then, we found survival rates per culture and binned them accordingly, similarly to how houses were binned. 

In [24]:
culture_counts = pd.DataFrame(characters.culture.value_counts()).reset_index()
survival_counts = pd.DataFrame(characters.groupby(['culture']).isAlive.value_counts())
survival_counts.rename({'isAlive':'count_survived'}, axis='columns', inplace=True)
survival_counts = pd.DataFrame(survival_counts.to_records())
#for Qohor and Astapori, no one survived — this will be taken into account later and for now they will be dropped
survival_counts = survival_counts[survival_counts['isAlive'] == 1] 
survival_counts.drop(columns=["isAlive"], inplace=True)
culture_counts = culture_counts.rename(columns={'index':'culture', "culture" : "total"})
survival_df = culture_counts.merge(survival_counts)
survival_df['percent_survived'] = (survival_df['count_survived']/survival_df['total'])*100
survival_df = survival_df.sort_values(by=['percent_survived'], ascending=False)
characters.culture = characters.culture.replace(to_replace = ['Ibbenese', 'Asshai', 'Lhazarene', 'Summer Isles', 'First Men', 'Naathi', 'Norvoshi', 'Rhoynar', 'Crannogmen'], value = "all_survive")
characters.culture = characters.culture.replace(to_replace = ['Ironborn', 'Ghiscari', 'Vale mountain clans', 'Dornish', 'Reach'], value = "most_survive")
characters.culture = characters.culture.replace(to_replace = ['Dothraki', 'Stormlands', 'Rivermen', 'Braavosi', 'Northmen'], value = "many_survive")
characters.culture = characters.culture.replace(to_replace = ['Qartheen', 'Myrish', 'Lysene', 'Valemen', 'Northern mountain clans', 'Tyroshi', 'Westeros'], value = "morethanhalf_survive")
characters.culture = characters.culture.replace(to_replace = ['Westermen', 'Riverlands', 'Pentoshi', 'Free Folk', 'Sistermen', 'Meereen'], value = "half_survive")
characters.culture = characters.culture.replace(to_replace = ['Riverlands', 'Lysene', 'Valyrian'], value = "few_survive")
characters.culture = characters.culture.replace(to_replace = ['Wildling', 'Valyrian'], value = "few_survive")
characters.culture = characters.culture.replace(to_replace = ['Astapori', 'Qohor'], value = "none_survive")
culture_counts = pd.DataFrame(characters.culture.value_counts()).reset_index()
culture_counts = culture_counts.rename(columns={'index':'culture_bin', "culture" : "count"})

Unnamed: 0,culture_bin,count
0,most_survive,186
1,many_survive,167
2,morethanhalf_survive,49
3,few_survive,44
4,half_survive,29
5,all_survive,15
6,none_survive,3


## Exploratory data analysis -- not answered

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

In [2]:
# I feel like the corr graph was not helpful?? what do you guys think

In [None]:
# maybe stuff like value counts? one of the things that was important was the valye counts on who was alive and who was dead

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a logistic regression model and optimized classification accuracy. We chose to optimize classification accuracy as a whole and not recall or precision because we were interested in correctly predicting the outcomes of as many characters as possible. At first, we approached the problem with a linear regression model because that was what we were familiar with from the course material. However, after the class covered logistic regression, we realized logistic modelling would be appropriate for predicting the binary measure isAlive.

Other people have addressed before the topic of GOT characters' survival, but the two models we saw were distinct from ours. The first one by students at Oberlin University predicted the chances of survive in upcoming books (http://allendowney.blogspot.com/2015/03/bayesian-survival-analysis-for-game-of.html. For the second one, it seems like the project also created a model to predict whether characters have already survived and died, but they calculated a percentage likelihood of survival instead of a binary result. With machine learning for their modeling, they obtained a final validation accuracy of 89.92%.

## Developing the model -- not answered maybe do this together thurs

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

We used forward model selection to select the best model for prediction. We chose this specifically because we had too many predictors for best subset selection to be feasible with time/processor constraints. Our algorithm for forward selection differed from the one covered in class slightly; where the one in class used pseudo-$R^2$ to select the best model, ours used calculated accuracy from leave-one-out cross validation. Overall, we were able to obtain a model accuracy of ____%.
We chose not to address multicollinearity, since our problem is primarily a prediction problem, and multicollinearity tends not to interfere with prediction. We found that variable transformations, apart from the binning of house and culture, were not necessary, as we didn't find any predictors that had a clear non-linear relationship with the data in our EDA. TALK ABOUT OUTLIERS AND INFLUENTIAL POINTS, DECISION THRESHOLD, ETC

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

The inferences made from our model will be affected as new GOT books are released. Character information is subject to change, and the data we used to train our model on will become outdated as character development continues and characters change. Specifically, predictors like isMarried or isNoble are subject to change for any still living character. As more data comes out and more characters are introduced, our model may become less useful than models trained on more updated data.
As more characters die, we may find that the predictors we used might not be as significant, or possibly not significant at all. Therefore, our model will hold until the sixth book is released. After that, we assume the model will still be significant, but potentially less so.

## Conclusions and Recommendations to stakeholder(s) -- not answered

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-} -- not answered need to fill out github section

**Github link** for the project repository.

https://github.com/cbugayer/RegressionObsession

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Divya Bhardwa</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Charles Bugayer</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Cat Tawadros</td>
    <td>Cross validation/testing implementation, model selection algorithm development, outlier treatment, decision threshold determination</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Annie Xia </td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

It proved to be quite useful in most scenarios, however was a bit difficult to navigate when dealing with merging conflicts. Because of this, we began working in separate files, which led to some disorganization in the early stages of the project.

## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.