etl-project

Extract, transform and load!

Project Proposal

Imagine a scenario where a bunch of data came in and you and your team are tasked with migrating it to a production data base.

Finding Data

The team decided to tackle data about current U.S. events regarding police shootings. The team found a useful data set in kaggle (in form of static CSVs). The team acknowledges that static data lends itself to aging and ideally data shoudl be coming from API to stay more current.

Data Source: https://www.kaggle.com/kwullum/fatal-police-shootings-in-the-us/¶

Data Sets:

PoliceKillingsUS.csv
ShareRaceByCity.csv
PercentOver25CompletedHighSchool.csv
PercentagePeopleBelowPovertyLevel.csv
MedianHouseholdIncome2015.csv

Data Cleanup & Analysis

All dataset collected share geolocation data: city and state. However, cities had different format, ex: Atlanta City on ShareRaceByCity.csv and Atlanta City city on PoliceKillingsUS.csv

First thing, was to assign a unique city ID to a specific geolocation (city, city type, state combination). This was necessary because, states aren't very creative with city names --- the same city shows up repeatedly. The city dataframe (with subsequent merges) was used to assign IDs to all datasets. Formatting and extraction was then performed with pandas to clean up indvidual city entries. For this step we used pandas process such as split and replace to get specifically the words we needed

We decided to use SQL vs MongoDB because the data used is well structured, despite the cleaning process we performed. Additionally, after using QuickDBD to visualize the schema, we saw that building the relationship model effectively bonded the five tables using city attributes. We chose a city_db as the main-connector to all other tables. Since, the datasets aren't varying/dynamic in structure, Postgres was our go-to SQL-based database of choice.

The reason behind leaving 5 tables instead of concatenating or merging them into 2 or 3 is that after a close review of each table we can see that not all 'city_id's are on each table, so if we merge them we will have NA or lose some data. We wanted to keep all data available, when new city data becomes available, we can populate income data even when educational data may not be available.

Once getting all the DataFrames as we expected to start an analysis:

Created the DB manually on PostgreSQL
Created the connection to PostgreSQL via SQLAlchemy
Created each of the 5 tables
Confirm table contents by looking directly at the PostgreSQL program and with SQLAlchemy (making queries to said tables.

(Note: team recognizes the access limitation we impose on the data using a local Postgres server. A better approach would be to host the database on the cloud with an online server such as Amazon Web Services (AWS))

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitignore		.gitignore
Fatal_Police_Shootings_A.ipynb		Fatal_Police_Shootings_A.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etl-project

Project Proposal

Finding Data

Data Cleanup & Analysis

Regards Team F

About

Releases

Packages

Contributors 2

Languages

felipeomurillo/etl-project

Folders and files

Latest commit

History

Repository files navigation

etl-project

Project Proposal

Finding Data

Data Cleanup & Analysis

Regards Team F

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages