<a href="https://colab.research.google.com/github/cfcastillo/DS-6-Notebooks/blob/main/Capstone_MS1_Notebook_cfc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Milestone 1 - Topic Submissions and Data Collection

The purpose of Capstone Milestone 1 is to submit at least two capstone project ideas along with data sources for each project. Deliverables are:

* A preliminary problem definition
* Information about the relevant data
* Data sources and data type
* Data loaded into Pandas DataFrame
* An overview of the data - rows, columns, number of nulls, etc.
 

# Imports

In [None]:
# grab the imports needed for the project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

# all
from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import classification_report
import sklearn.model_selection as model_selection

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error




In [None]:
# Import dbf file of neighborhoods
# https://dbfread.readthedocs.io/en/latest/
!pip3 install dbfread

Collecting dbfread
  Downloading dbfread-2.0.7-py2.py3-none-any.whl (20 kB)
Installing collected packages: dbfread
Successfully installed dbfread-2.0.7


In [None]:
# dbf file import
from dbfread import DBF
from pandas import DataFrame

In [None]:
# Mount Drive
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


# Idea 1 - Post Secondary Enrollment Trends

I have seen a downward trend in college education during my lifetime. College was very important in the 1980's and 1990's and was a requirement for many jobs. Without Internet resources, college was almost exlusively the only means of obtaining higher level education. With the Internet, YouTube&trade; and many online learning resources, information and education outside college is plentiful. As a result, college requirements for employment are declining. More employers are accepting people based on experience or certifications self-taught or through formal channels. So education is still important. What is changing is the mechanism for obtaining that education.

**Problem Definition**

What factors determine post secondary education choice? **Supervised Classification**

Scope - County or Tract level - % that go whatever route.

**Classifications**

* Traditional University or College
* Trade school
* Online learning - proctored
* Online learning - self-paced

**Data Requirements**

* Demographic data (gender, race, age, income, location, household size, citizenship)
* Learning costs (tuition, board, fees, travel)
 * Scholarships and loans available.
* Learning entry requirements (test scores, interview requirements)
* Learning choices (college/university, trade school, udemy, coursera)
* Learning characteristics (in person, remote, self paced, proctored, certificate, hours)
* Job availability by type
 * Education requirements for jobs.

## Data Collection - Idea 1

[NM Census data](https://www.kaggle.com/muonneutrino/us-census-demographic-data?select=acs2017_census_tract_data.csv)

[NM Geo Census data - 2020](https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2020&layergroup=Census+Tracts)

[US College Enrollment data](https://educationdata.org/college-enrollment-statistics)

*Per educationdata.org, for nearly every state, around 90% of students are from outside the state.* 

*In NM, there has been a 24% decline in enrollment since 2010 while other states have declined less or have increased. Why?*

[College spending by state](https://educationdata.org/public-education-spending-statistics)

*Note: NM is 40th in spending per student. Other states with known education issues rank much higher - such as NY, MD, PA. The question is "Does more money equate to a better education?"*

[Percent of recent HS grads enrolled in college](https://nces.ed.gov/programs/digest/d18/tables/dt18_302.20.asp)

[OECD.stat Organization for Economic Cooperation and Development](https://stats.oecd.org/Index.aspx)

*Note: Most recent data is 2012. Need to investigate further*

[BEA.gov US Bureau of Economic Analysis](https://www.bea.gov/data/by-place-us)

[UNESCO Institute for Statistics](http://data.uis.unesco.org/)

[US Department of Education Budget Tables](https://www2.ed.gov/about/overview/budget/history/index.html)

[National Science Foundation - Science and Engineering Indicators](https://ncses.nsf.gov/indicators/states/indicator/state-student-aid-expenditures-per-full-time-undergraduate-student)

## Data Overview - Idea 1

**Data Availability**

* Lots of education statistics are available making this a practical project topic.
* Most data is in Excel format so will be easily imported and converted to DataFrame.
* Data will be from multiple sources so will need to be merged together.

**Tasks**

* Refine problem definition so I know what indicators will be needed for project analysis.
* Specifically identify data sources that provide needed indicators.
* Import, merge, transform, clean, review data sources in more detail to prepare for project analysis.




In [None]:
# Census Tract Geo Data
census_tract_path = "/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/tl_2020_35_tract/tl_2020_35_tract.dbf"
census_tract_dbf = DBF(census_tract_path)
census_tract_df = DataFrame(census_tract_dbf)
census_tract_df.head()
# census_tract_df.info()

Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,GEOID,NAME,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON
0,35,13,1207,35013001207,12.07,Census Tract 12.07,G5020,S,139531826,142378,32.3771603,-106.5763162
1,35,13,1314,35013001314,13.14,Census Tract 13.14,G5020,S,246796932,0,32.5239438,-106.6786859
2,35,13,1310,35013001310,13.1,Census Tract 13.10,G5020,S,4789927,0,32.3798068,-106.7437402
3,35,13,1316,35013001316,13.16,Census Tract 13.16,G5020,S,4992654,0,32.393647,-106.7153743
4,35,13,1208,35013001208,12.08,Census Tract 12.08,G5020,S,21632784,56931,32.3599472,-106.7188392


# Idea 2 - Consumer Transportation Choices - SCRAP

Apparently, this is a pet project for me; How to get people to take the Albuquerque city bus. My mom reminded me that I had a college project centered on this same topic. The purpose of this project would be to understand what factors contribute to people's transportation choices so that ultimately we could increase ride-sharing options thus reducing the number of individual cars on the road, thus reducing emissions and traffic congestion.

**Problem Definition**

What factors determine transportation choice? **Supervised Classification**

**Classifications**

* Bus
* Rideshare (uber, lyft, cab, van)
* Rental (Zip, other)
* Physical (walk, bicycle, skateboard, scooter)
* Personal vehicle 

**Data Requirements**

* Demographic data (gender, race, age, income, location, household size, citizenship)
* Transportation costs (purchase, rent, fare, insurance, fuel, repairs)
* Distance traveled (commuting, running errands)
* Usage frequency



## Data Collection - Idea 2

[Bus routes and stops for ABQ RIDE](https://opendata.cabq.gov/dataset/bus-routes-and-stops-for-abq-ride)

*Note: the data contains links to other useful data*

[NM State Data Center Program](https://gonm.biz/site-selection/state-data-center-program)

*Note: contains modes of transportation for each county by gender*

## Data Overview - Idea 2

**Data Availability**

Data seems limited. I could find data on bus routes in Albuquerque and modes of transportation. However some factors might contribute to lack of data availability.

* Are all groups represented in data? For example are homeless people represented? Are people who pay with cash represented? Are people who work remotely considered?

* I could not find available data on Lyft, Uber, Taxi, rideshare, etc.

* How would we get data on people who ride bicycles, skateboards, scooters, walk?

# Project Conclusions - Before Project Completion

The Internet has changed education and transportation. With the Internet, information is readily available. People can perform many tasks from the convenience of their home; work at home, shop at home, learn at home. The Internet has also changed our pace of life. People expect things to happen quickly. People do not want to spend several years and thousands of dollars on a college education if they can get a good job without making those sacrifices. People do not want to wait for public transportation if there are more convenient, affordable options available. People want services tailored to their lives, not the other way around.

It seems that people are concerned with social media and social standing. So perhaps the way to increase college enrollment and public transportation usage is to market the social aspects of each. Every choice involves cost vs. benefit. We have to create perceived value for people to reconsider college and public transportation. 