# Checkpoint 1

Reminder:

- You are being evaluated for completion and effort in this checkpoint.
- Avoid manual labor / hard coding as much as possible, everything we've taught you so far are meant to simplify and automate your process.

We will be working with the same `states_edu.csv` that you should already be familiar with from the tutorial.

We investigated Grade 8 reading score in the tutorial. For this checkpoint, you are asked to investigate another test. Here's an overview:

* Choose a specific response variable to focus on
>Grade 4 Math, Grade 4 Reading, Grade 8 Math
* Pick or create features to use
>Will all the features be useful in predicting test score? Are some more important than others? Should you standardize, bin, or scale the data?
* Explore the data as it relates to that test
>Create at least 2 visualizations (graphs), each with a caption describing the graph and what it tells us about the data
* Create training and testing data
>Do you want to train on all the data? Only data from the last 10 years? Only Michigan data?
* Train a ML model to predict outcome
>Define what you want to predict, and pick a model in sklearn to use (see sklearn <a href="https://scikit-learn.org/stable/modules/linear_model.html">regressors</a>).


Include comments throughout your code! Every cleanup and preprocessing task should be documented.

<h2> Data Cleanup </h2>

Import `numpy`, `pandas`, and `matplotlib`.

(Feel free to import other libraries!)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load in the "states_edu.csv" dataset and take a look at the head of the data

---

:

In [None]:
df = pd.read_csv("starbucks.csv")

You should always familiarize yourself with what each column in the dataframe represents. Read about the states_edu dataset here: https://www.kaggle.com/noriuk/us-education-datasets-unification-project

Use this space to rename columns, deal with missing data, etc. _(optional)_

1.   remaning columns for consistency
2.   handling missing data



In [None]:
# Renaming columns to lowercase and replacing spaces with underscores
df.columns = df.columns.str.lower().str.replace(" ", "_")

# Filling missing values in important columns with the median
df['avg_math_4_score'].fillna(df['avg_math_4_score'].median(), inplace=True)
df['avg_math_8_score'].fillna(df['avg_math_8_score'].median(), inplace=True)


<h2>Exploratory Data Analysis (EDA) </h2>

Chosen one of Grade 4 Reading, Grade 4 Math, or Grade 8 Math to focus on: Grade 4 math

How many years of data are logged in our dataset?

In [None]:
# Focus on Grade 4 Math and check how many unique years of data we have
grade_4_math_data = df[~df['avg_math_4_score'].isnull()]

# Get the number of unique years
unique_years = grade_4_math_data['year'].nunique()

# Display the result
print(f"There are {unique_years} unique years of data for Grade 4 Math.")


# Let's compare Michigan to Ohio. Which state has the higher average across all years in the test you chose?

In [None]:
# Focusing on Grade 4 Math and comparing Michigan to Ohio
grade_4_math_data = df[~df['avg_math_4_score'].isnull()]

# Filter for Michigan and Ohio
michigan_data = grade_4_math_data[grade_4_math_data['state'] == 'MICHIGAN']
ohio_data = grade_4_math_data[grade_4_math_data['state'] == 'OHIO']

# Calculate the average for each state
michigan_avg = michigan_data['avg_math_4_score'].mean()
ohio_avg = ohio_data['avg_math_4_score'].mean()

# Display the results
michigan_avg, ohio_avg


Find the average for your chosen test across all states in 2019

In [None]:
# Focusing on Grade 4 Math and filtering for the year 2019
grade_4_math_2019 = df[(df['year'] == 2019) & (~df['avg_math_4_score'].isnull())]

# Calculate the average score for Grade 4 Math across all states in 2019
average_2019_grade_4_math = grade_4_math_2019['avg_math_4_score'].mean()

# Display the result
print(f"The average Grade 4 Math score across all states in 2019 is {average_2019_grade_4_math:.2f}")


For each state, find a maximum value for your chosen test score

*   Find the max value for Grade 4 math for each state
*   Find the mac value for Grade 8 math for each state



In [None]:
# Step 1: Find the maximum Grade 4 Math score for each state
max_grade_4_math_per_state = df.groupby('state')['avg_math_4_score'].max()

# Step 2: Find the maximum Grade 8 Math score for each state
max_grade_8_math_per_state = df.groupby('state')['avg_math_8_score'].max()

# Display both results
print("Maximum Grade 4 Math score per state:")
print(max_grade_4_math_per_state)

print("\nMaximum Grade 8 Math score per state:")
print(max_grade_8_math_per_state)


*Refer to the `Grouping and Aggregating` section in Tutorial 0 if you are stuck.

<h2> Feature Engineering </h2>

After exploring the data, you can choose to modify features that you would use to predict the performance of the students on your chosen response variable.

You can also create your own features. For example, perhaps you figured that maybe a state's expenditure per student may affect their overall academic performance so you create a expenditure_per_student feature.

Use this space to modify or create features.

In [None]:
# Ensure there are no NaN values in the relevant columns
df['total_expenditure'].fillna(0, inplace=True)
df['enroll'].fillna(1, inplace=True)  # Avoid division by zero

# Step 1: Create expenditure_per_student feature
df['expenditure_per_student'] = df['total_expenditure'] / df['enroll']

# Step 2: Justify the feature creation
# This feature is created to capture how much money is spent on each student, which could have an impact on student performance.

# Display the new column in the dataframe
df[['state', 'year', 'expenditure_per_student']].head()


Feature engineering justification: **<One feature is expenditure per student, because we may want to see how investment in education affects overall performance\>**

<h2>Visualization</h2>

Investigate the relationship between your chosen response variable and at least two predictors using visualizations. Write down your observations.

**Visualization 1**

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of Grade 4 Math score vs expenditure_per_student
plt.figure(figsize=(10, 6))
plt.scatter(df['expenditure_per_student'], df['avg_math_4_score'], color='blue', alpha=0.5)
plt.title('Grade 4 Math Score vs Expenditure Per Student')
plt.xlabel('Expenditure Per Student')
plt.ylabel('Grade 4 Math Score')
plt.grid(True)
plt.show()


**<Grade 4 math score vs expenditure per student>**

**Visualization 2**

In [None]:
# Scatter plot of Grade 4 Math score vs federal_revenue
plt.figure(figsize=(10, 6))
plt.scatter(df['federal_revenue'], df['avg_math_4_score'], color='green', alpha=0.5)
plt.title('Grade 4 Math Score vs Federal Revenue')
plt.xlabel('Federal Revenue')
plt.ylabel('Grade 4 Math Score')
plt.grid(True)
plt.show()


** *italicized text*<grade 4 math scores vs fed revenue>**

<h2> Data Creation </h2>

_Use this space to create train/test data_

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['expenditure_per_student', 'federal_revenue']]
y = df['avg_math_4_score']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


<h2> Prediction </h2>

ML Models [Resource](https://medium.com/@vijaya.beeravalli/comparison-of-machine-learning-classification-models-for-credit-card-default-data-c3cf805c9a5a)

In [None]:
from sklearn.linear_model import LinearRegression


In [None]:
model = LinearRegression()


In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print(y_pred)


## Evaluation

Choose some metrics to evaluate the performance of your model, some of them are mentioned in the tutorial.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score


# Evaluate model performance using Mean Squared Error (MSE) and R-squared (R2 Score)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 Score): {r2}")

We have copied over the graphs that visualize the model's performance on the training and testing set.

Change `col_name` and modify the call to `plt.ylabel()` to isolate how a single predictor affects the model.

In [None]:
import matplotlib.pyplot as plt

col_name = 'expenditure_per_student'

f = plt.figure(figsize=(12,6))
plt.scatter(X_train[col_name], y_train, color="red")
plt.scatter(X_train[col_name], model.predict(X_train), color="green")
plt.legend(['True Training', 'Predicted Training'])
plt.xlabel(col_name)
plt.ylabel('Grade 4 Math Score')
plt.title("Model Behavior On Training Set")
plt.show()


In [None]:
import matplotlib.pyplot as plt

f = plt.figure(figsize=(12,6))
plt.scatter(X_test[col_name], y_test, color="blue")
plt.scatter(X_test[col_name], model.predict(X_test), color="black")
plt.legend(['True Testing', 'Predicted Testing'])
plt.xlabel(col_name)
plt.ylabel('Grade 4 Math Score')
plt.title("Model Behavior On Testing Set")
plt.show()
