## Background / Motivation

Throughout the STAT 303-2 class, we explored several modelling questions related to big-picture topics such as housing and bank finance. However, for our final project, we were interested in looking at how regression analysis could apply to questions that come up in our daily lives and conversations. For example, what do you think is going to happen next in your favorite book? 

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

We created a model to see if we could accurately predict whether Game of Thrones characters were dead or alive by the end of fifth book. The objective was inference. By seeing if we could make a model that accurately predicts whether the characters survive or not, we could make inferences based on the best predictors to see what influences a character's chances of survival.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

We are using character-predictions.csv from the following Game of Thrones' dataset on Kaggle. The dataframe contains 1946 observations, but after filtering out NaN values for the columns we used as predictors, there were 493 characters with no missing data. The outcome we tried to predict was the column, isAlive. The predictors we used to do so were house, culture, male, and isNoble.

https://www.kaggle.com/datasets/mylesoneill/game-of-thrones?select=battles.csv

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

For the past decade, Game of Thrones readers have been waiting for the next installment in the series to be published. The most recent book ended with several cliffhangers, so fans have been left speculating what will happen next. A model with insight into what makes a person in the GOT universe more likely to die would be interesting to readers concerned about their favorite characters. With the presence of fandom communities online, especially for blockbusters such as Game of Thrones, there would be a strong appetite for content delving into the world. Although a regression model about the series would not the same as George R. R. Martin's next novel, a nerdy analysis about the books could help tide fans over until the release date finally arrives. 

## Data quality check / cleaning / preparation -- not answered

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

## Exploratory data analysis -- not answered

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a logistic regression model and optimized classification accuracy. We chose to optimize classification accuracy as a whole and not recall or precision because we were interested in correctly predicting the outcomes of as many characters as possible. At first, we approached the problem with a linear regression model because that was what we were familiar with from the course material. However, after the class covered logistic regression, we realized logistic modelling would be appropriate for predicting the binary measure isAlive.

Other people have addressed before the topic of GOT characters' survival, but the two models we saw were distinct from ours. The first one by students at Oberlin University predicted the chances of survive in upcoming books (http://allendowney.blogspot.com/2015/03/bayesian-survival-analysis-for-game-of.html. For the second one, it seems like the project also created a model to predict whether characters have already survived and died, but they calculated a percentage likelihood of survival instead of a binary result. With machine learning for their modelling, they obtained a final validation accuracy of 89.92%. 

## Developing the model -- not answered maybe do this together thurs

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

The inferences made from our model will be affected as new GOT books are released. As more characters die, the predictors we used might not be as significant, or possibly not significant at all. Therefore, our model will hold until the sixth book is released. After that, we think the model should still be significant but it is not guaranteed. Our model would become obsolete in terms of usefulness when interest in GOT fades away.

## Conclusions and Recommendations to stakeholder(s) -- not answered

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-} -- not answered need to fill out github section

**Github link** for the project repository.

https://github.com/cbugayer/RegressionObsession

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Divya Bhardwa</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Charles Bugayer</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Cat Tawadros</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Annie Xia </td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.