# Deeper Dive into the Digital Divide

Flatiron School Capstone Project

## Overview

This project seeks to create a model to understand the impact of the digital divide by predicting student assessment performance outcomes in the United States.

### ReadMe Navigation

1. [Repository Navigation](#Repository-Navigation)

2. [Business Understanding](#Business-Understanding)

3. [Data Understanding](#Data-Understanding)

4. [Predictive Analysis](#Predictive-Analysis)
    1. Model Performance
    2. Prediction

5. [Conclusion](#Conclusion)
    1. Recommendations
    2. Areas for Growth

8. [Project Info](#Project-Info)

***

## Repository Navigation

- [DATA:](data)
    - *The root of this folder contains final prepped data**
    - [Raw Folder](data/raw)
        - Original ACS 5 Year Data
        - Original NCES Data
        
    - [Processed Folder](data/processed)
        
        
   
- [SRC:](src)
    - [Python Initilization file](src/__init__.py)
    - [Script for Data Acquisition](src/data_acquisition.py)
    - [Script for Data Cleaning](src/data_cleaning.py)
    - [Script for Visualizations](src/visualizations.py)
    - [Script for Modeling](src/modeling.py)
        
- [FIGURES:](figures)
    - Saved figures and visualizations used in notebooks/presentation
    
- [MODELS:](models)
    - Pickled files storing data relevant to the model creation.

- [NOTEBOOKS:](notebooks)
    - [Executive Notebook](notebooks/executive_notebook.ipynb)
    - [Data Acquisition](notebooks/acquisition.ipynb)
    - [Exploratory Data Analysis (EDA)](notebooks/EDA.ipynb)
    - [Visualizations](notebooks/visualizations.ipynb)
    - [Modelling and Evaluation](notebooks/models.ipynb)

- [PRESENTATION:](presentation)
    - [PDF](presentation/capstone_presentation.pdf)
    - [Powerpoint](presentation/capstone_presentation.pptx)

## Business Understanding

Across the country (and the world), as the academic year 2020-21 begins, the covid19 pandemic is forcing many schools to only reopen virtually. Given the already existing inequalities in educational outcomes across the country, this continually deepening digital divide will set some students back more than others. This project seeks to create a model to predict the impact of that divide on students' performance. By using this model, stakeholders can understand how important it is to mitigate this solve-able problem, and further predict which district would be most impacted by a deepening digital divide.

The "understanding", therefore, is more of a "socio-politial" understanding than a business understanding. Stakeholders in this project would include:
1. Nonprofits seeking to distribute devices
2. Local governments seeking to install broadband access

Specifically, individual districts could use this model to predict how much of an impact a certain percentage mitigation of the digital divide would have upon assessment scores.


**GOAL: Obtain the importance of the digital divide (broadband and device access) in accurately predicting students' educational assessment scores.**

***

## Data Understanding

The dataset is a concatenation of two datasets:

- National Center for Education Statistics (NCES) Assessment Scores
- United States Census American Community Survey 5-year Estimates


### Data Limitations

1. Assessment data limited
    - 2017-18
    - Testing not done in 2020
    
2. Census data limited
    - The datasets used were limited to what was available in the census
    - Additionally this data was filtered down to only households that have children of school age. So, for example, the digital divide would include single adults living without internet in a given district. That could be adjusted with further work. 
    
### Dataset Features

In [1]:
import pandas as pd
df = pd.read_csv('data/raw/zillow_data.csv')
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


**Data Analysis**
- Given the monthly price measurements, we used **Time Series Analysis** to gain a better understanding for the patterns in the data that would impact our predictions.

- Additionally, we restrict our model construction to data ***after the 2008 recession,*** since the variation due to the anomaly is already known, and our model will perform better without having to account for such a huge fluctuation.

- Finally, given computational resource restrictions, we ***limit our dataset*** to the 60 zip codes with the largest ROI in a 5-year rolling window.

## Models - Predictive Analysis

### Linear Regression

### Decision Tree Regressor

### Random Forest Regressor


## Model Performance Evaluation

### Performance Metric


###  Model Selection




## Conclusions



### Recommendations:




### Areas for Growth:

#### Include Adjacent Data Resources
With additional time, we would incorporate other ways to measure "top 5" and ROI, including usin other sources like:
- Zillow Rent Data
- Competitor Median Prices
- AirBnB rental increases/returns

#### Improve Model
The model was significantly limited due to time and resource constraints. With more of each, we could:
- Fit model to all zip codes, not just limited set
- Consider grouping zip codes and areas into better tiers
- Train model on nationwide data for the ability to "drill-down" more precisely

#### Extend Time Frame for Analysis
With more data, we could predict longer term trends instead of just the limited 5-year period we selected.


## Project Info

Contributors: __[Alexander](https://www.linkedin.com/in/anewt/)__ __[Newton](https://github.com/anewt225)__

Languages  : Python

Tools/IDE  : Git, Command Line (Windows), Anaconda, Jupyter Notebook / Jupyter Lab, Google Slides

Libraries  : numpy, pandas, matplotlib, seaborn, scikit-learn, statsmodels 

Duration   : August 2020
Last Update: 08.31.2020


```python

```
