# Intermediate Project - SAAS Career Exploration - Part 1

For this project, you will be using the tools that you have learned so far on a real data science problem. You will be cleaning and analyzing a dataset to answer a research question, something that you will be doing for your entire career if you continue down this path.

This project will be done **in groups**. Modern research is collaborative, so get used to it! If you are having trouble finding a partner, please contact your committee director or post on the Career-Exploration Slack channel.

The final product for this project will be a statistical model that answers a question posed about the data, in addition to a short description about how your model works and its limitations.

## 1. Picking a Dataset

You may pick from the following datasets:

### US Census Bureau - 2015 American Community Survey

https://www.kaggle.com/muonneutrino/us-census-demographic-data

This dataset gives demographic information about the US, indexed by census tract. 

You may enjoy this dataset if you:
* are interested in social science
* follow politics
* are a "spatial" thinker

### BoardGameGeek - Board Game Dataset

https://www.kaggle.com/mrpantherson/board-game-data

This dataset gives information on the top 5000 board games, as rated by BoardGameGeek, the gold standard rating site for tabletop gaming enthusiasts.

You may enjoy this dataset if you:
* play board games
* like reviewing things (e.g. Yelp)

The meaning of the columns:
* rank - The overall ranking of the game
* bgg_url - The link to the game's page on the BoardGameGeek website
* game_id - A unique identifier number for the game
* names - The name of the game
* min_players - The minimum number of players required to play the game
* max_players - The maximum number of players that can play the game
* avg_time - The average time the game takes to play
* min_time - The minimum time the game takes to play
* max_time - The maximum time the game takes to play
* year - The year the game was released
* avg_rating - The average rating of the game among all site users
* geek_rating - The average rating of the game among "elite" users
* num_votes - The number of times the game has been rated
* image_url - URL of an image of the game
* age - Minimum recommended age for the game
* mechanic - List of mechanics the game has
* owned - How many people on BoardGameGeek say they have the game
* category - List of genres the game is in
* designer - Who designed the game
* weight - How complex the game is, on a scale of 1-5

### Cortez et al. - Wine Dataset (Vinho Verde)

https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

A dataset recording sensory and chemical attributes of wines, along with ratings given to each of them.

You may enjoy this dataset if you:

* are suspicious of the legitimacy of wine ratings
* are interested in chemistry
* day-drink

<span style="color:blue"> Pick one of these datasets and uncomment the corresponding `pd.read_csv` command.</span>


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd

# census = pd.read_csv("data/acs2015_census_tract_data.csv")
# games  = pd.read_csv("data/bgg_db_1806.csv")
# wines  = pd.read_csv("data/winequality-red.csv")

## 2. Exploratory Data Analysis (EDA)

When you first start working with a dataset, it is often a good idea to just play around with the data to get an idea of what you're working with. This can help you make decisions about which parts of your data may be more useful than others, and also give us a basis for sanity checks later on when our code becomes more complicated. For example, in the the wines dataset we may see that the "quality" feature only ranges from 3 to 8, so if our model predicts a quality of 73 then we know there's something wrong!

### 2.1. - df.head()

One easy way to see what your dataset looks like is by using the `.head()` function, which will display the first five rows of your dataframe. Remember to replace `df` with the name of your dataset!

In [None]:
# Use the .head() function on your dataset
# YOUR CODE HERE



### 2.2. - df.info() and df.describe()

Although the `.head()` function is a convenient way to peek at your dataset, it's important to remember that it only gives you the first few rows. Especially in cases when you're trying to read data from some poorly-formatted csv, it's not uncommon for everything to be fine in the first few rows of your data (so that the `.head()` output looks normal) but then at some point everything turns into `NaN`s. In order to avoid this kind of problem, pandas has other functions that let you look at your data from a broader perspective. The `.info()` function will give you information about what kind of data each of your columns contains, and the `.describe()` function will give you summary statistics for each column.<span style="color:blue"> Try them out!</span>

In [None]:
# Use the .info() function on your dataset
# YOUR CODE HERE



In [None]:
# Use the .describe() function on your dataset
# YOUR CODE HERE



### 2.3. More data-specific analysis

After running the above functions, you have some general information about what your data looks like, so you can start running analyses that are more specific to the dataset you're working with. <span style="color:blue">For these parts, pick two quantitative columns of your data - one that you're interested in predicting, which I will call $y$, and one that you think will be predictive of $y$, which I will call $x$. After you've selected these columns, filter the rows of your data into two subsets - one where the $x$ value is smaller than the mean $x$ value, and one where the $x$ value is larger than the mean $x$ value.</span>

In [None]:
smaller_x_subset = # YOUR CODE HERE
larger_x_subset  = # YOUR CODE HERE

Then, print the mean of $y$ in both of these subsets.

In [None]:
smaller_x_subset_y_mean = # YOUR CODE HERE
larger_x_subset_y_mean  = # YOUR CODE HERE
print("Mean y in smaller x:", smaller_x_subset_y_mean, "Mean y in larger x:", larger_x_subset_y_mean)

Did you find these two means to be far apart? Close together? <span style="color:blue">In words, describe what you found and what you think it shows about the relationship between $x$ and $y$.</span>

YOUR ANSWER HERE

**BONUS**: If we want to try this analysis with a different pair of columns for $x$ and $y$, it would be good coding practice to define a function for this task instead of copying our previous code and changing the column names. <span style="color:blue"> Try making a function that takes in the names of two columns for $x$ and $y$ and performs the above operations. </span>

In [None]:
def print_subsetted_y_means(df,x_name,y_name):
    """
    Prints the mean of y where x is smaller than the mean of x, and where x is larger than the mean of x.
    
    df: The dataframe containing our data.
    x_name: The name of the x column.
    y_name: The name of the y column.
    """
    
    # YOUR CODE HERE

### 2.4. Creating new features

When we get to modeling later in the semester, we often don't just want to use the data as it was given to us; instead, we can transform the data into new **features** which we think may be more useful. For example, suppose we thought that instead of using $x$ to predict $y$, we thought that $x^2$ would be more useful in predicting $y$. <span style="color:blue">Add a new column to your dataframe that contains $x^2$ instead of $x$.</span>

In [None]:
# YOUR CODE HERE



In addition to polynomial features like $x^2$ above, another common kind of feature is an **indicator variable** which is equal to 1 if some condition is true, and zero otherwise. For example, maybe we now think that the actual value of $x$ doesn't matter; all that's important is whether or not $x$ is greater than its mean. <span style="color:blue"> Add another column to your data that is equal to 0 if $x$ is greater than its mean, and zero otherwise. </span>

In [None]:
# YOUR CODE HERE



Congratulations, you're now finished with the first checkpoint!

## 3. Submission

**To submit, first save this file as a pdf by going to the top left and clicking File -> Download as -> PDF via LaTex (.pdf). Then, send the pdf to Prince on slack!**