# Unit Test 2

Topics covered:

* Logistic Regression
* Resampling Methods
* Subset Selection
* Shrinkage Methods

## Background
The study of Near-Earth Objects (NEOs) is critical to ensuring future planetary security from impacts by asteroids. Predicting potential impacts may seem as if it is a straightforward task but the number of variables involved in the process leads to quite a bit of uncertainty. Because of this, the Center for NEO Studies (https://cneos.jpl.nasa.gov/about/cneos.html) and NASA's Jet Propulsion Laboratory has been logging data about asteroids and whether they meet the classification of hazardous or not. The data is provided on Canvas.

### Dataset
`nasa.csv`

## Task
Your goal is to construct a model that effectively predicts whether an asteroid is hazardous or not. Ideally, you will find a relatively simple (i.e. interpretable) model, such that we don't need to collect every variable below to make a prediction (hint: subset selection and/or shrinkage would be useful for this). The fewer variables we have to collect, the more asteroids we'll be able to observe. I would also like you to utilize PCA to select some number of principal components and try a logistic regression with those as predictor variables.

I'm looking for a well-constructed logistic regression, whose assumptions have been checked, with strong cross-validated accuracy and an interpretation of the coefficients. 

In [None]:
## Data Preparation
import pandas as pd
df = pd.read_csv('nasa.csv')
df.head()

In [None]:
## Exploratory Data Analysis (EDA)
df.describe()
df.info()

In [None]:
## EDA continued
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
## Feature Engineering
df = pd.get_dummies(df, drop_first=True)

In [None]:
## Subset Selection
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
sfs = SequentialFeatureSelector(lr, n_features_to_select=10, direction='forward')
sfs.fit(df.drop(columns=['Hazardous']), df['Hazardous'])
selected_features = df.columns[sfs.get_support()]

In [None]:
## Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
principal_components = pca.fit_transform(df.drop(columns=['Hazardous']))

In [None]:
## Logistic Regression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[selected_features], df['Hazardous'], test_size=0.2, random_state=42)

lr.fit(X_train, y_train)

In [None]:
## Model Evaluation
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(lr, X_train, y_train, cv=5)

In [None]:
## Interpretation and Reporting
coeffs = pd.DataFrame(lr.coef_, columns=selected_features)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SequentialFeatureSelector
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load and inspect data
df = pd.read_csv('nasa.csv')
print(df.head())

# Step 2: EDA
print(df.describe())
print(df.info())

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()

# Step 3: Feature Engineering
df = pd.get_dummies(df, drop_first=True)

# Step 4: Subset Selection
lr = LogisticRegression()
sfs = SequentialFeatureSelector(lr, n_features_to_select=10, direction='forward')
sfs.fit(df.drop(columns=['Hazardous']), df['Hazardous'])
selected_features = df.columns[sfs.get_support()]

# Step 5: PCA
pca = PCA(n_components=10)
principal_components = pca.fit_transform(df.drop(columns=['Hazardous']))

# Step 6: Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(df[selected_features], df['Hazardous'], test_size=0.2, random_state=42)
lr.fit(X_train, y_train)

# Step 7: Model Evaluation
cv_scores = cross_val_score(lr, X_train, y_train, cv=5)
print(f'Cross-validated accuracy: {cv_scores.mean()}')

# Step 8: Interpretation
coeffs = pd.DataFrame(lr.coef_, columns=selected_features)
print(coeffs)

# Step 9: Reporting
# Summarize findings in a well-constructed report
