# 1: Origins of Data Science

## History 
The term has existed over 30 years and is originally attributed to Prof. Peter Naur from University of Copenhagen in 1974. Those days Data Science mostly referred to the process of data processing methods that acquired sophistication over time. A statistical dimension was provided by Prof. Jeff Wu at the Univeristy of Michigan in 1997 at the Indian Statistical Institute famously titled "Statistics = Data Science?". However, Data Science has evovled since then adding more dimensions such as Machine Learning, Applied Mathematics and other Engineering methods. 

#### Definition
Wiki defines Data Science as "Data Science employs techniques and theories drawn from many fields with the broad areas of mathematics, statistics, operations research, information science and computer science, including signal processing, probability models, machine learning, statistical learning, data mininig, database, data engineering, pattern recognition and learning, visualization, predictive analysis, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing." This definition mostly arises from a variety of fields within the academia and industry which have shaped the field. The area of Data Science, is still evolving with many fields from Physics, Mathematics, Statistics and Engineering adding a plethora of methods, all aimed at solving complex problems.

### What is Data Science? 
Data Science refers to solving complex problems by understanding data through exploratory data analysis. This involves many steps, primary amongst which are  Exploratory Data Analysis (EDA), Modeling and Algorithms followed by results and Data Visualization.

### Who is a Data Scientist?
A Data Scientist is an individual who can use data and statistical ability to discover, explore and interpret the given data and thereby building mathematical models that fits the data and then produce useful results. Harvard Business Review called the job as "The sexiest job of the 21st Century". 

### Technology Stack
Data Scientist technology stack varies across different organizations depending on the need. Primarily, Data Scientists use Python, R, various Python libraries such as Pandas and Scikit-Learn. Few Data Scientists could be expected to also work on large data sets that range in 100s of GB or several TBs, though typically Data Engineers and Computer Progammers fill in for roles of scaling these algorithms. 

### Course Prerequisites
Basic knowledge of Python
Mathematics - Including Calculus, Linear Algebra, Probability and Statistics.

### Instructions
One of the key factors known to have sparked off interest in Data Science is the famous "Netflix Prize", a competition that provided $1 Million to any winning team or individual who could build the best recommendation engine for movie ratings to predict user ratings for movies, just based on previous ratings. 

Suppose a user ranks interest in each category from a scale of 0 - 10:

User Interest of Movie category
Feel Good	Horror	Comedy	Adult	Tragedy
10.0	0.0	5.0	1.0	2.0
How do we find a movie that has components as shown below that aligns with the user? Let us use a prediction metric to predict the movie match to the user on a scale of 1 to 10.
 
## 1. Exercise

Movie Category
Feel Good	Horror	Comedy	Adult	Tragedy
7.0	2.0	9.0	7.0	2.0
This is a sample lesson that illustrates how to work with the console. You can edit code as per instructions, assign results to variables and then click on Run to execute them. 

Click on Run to detemine what is the match of the movie to the user preference on a scale of 1 to 10. 

In [1]:
from scipy.spatial import distance

# Prediction of movie interest based on user preferences on a scale to 1-10
user_pref = [10.0, 0.0, 5.0, 1.0, 2.0]
movie_features = [7.0, 2.0, 9.0, 7.0, 2.0]

user_match = distance.euclidean(user_pref, movie_features)
print("Movie match to the user on a scale of 1 to 10:", user_match)

Movie match to the user on a scale of 1 to 10: 8.06225774829855


# 2: Data Science Workflow

## Pipeline
A Data Science workflow or a pipeline refers to the standard activities that a Data Scientist performs from acquiring data to communicating the final results. 

Here are the important steps in the Data Science pipeline:

Data Acquisition
Exploratory Data Analysis (EDA)
Problem Identification
Modeling
Model Validation and Fine Tuning
Communicating Final Results
Scaling and Big Data

## 1. Data Acquisition
Acquiring data is the first step in the pipeline. This involves working with Data Engineers or Infrastructure Engineers to provide data in a structured format such as JSON, csv, or Text. Data Engineers are expected to provide the data in the known format to the Data Scientists. This involves parsing the data and pushing it to a SQL database or a format that is easy to work with. This can involve applying a known schema to the data that is already known or can be inferred from the original data. When original data is in unstructured format, the data needs to be cleaned and relevant data extracted from it. This involves using a regular expression parser or multiple methods of parsing such as using perl and unix scripts, or language of your choice to clean the data.

## 2. Exercise

#### Instructions
* Given the csv file about Women in STEM, read the contents of the file line by line and push it to a list variable,
stem_women.
* Print out first 5 elements of the list.

In [2]:
import requests

url = "https://raw.githubusercontent.com/Colaberry/538data/master/college-majors/women-stem.csv"
r = requests.get(url)
text = r.iter_lines()
stem_women = []

## 2. Solution

# 2: Exploratory Data Analysis

2. Exploratory Data Analysis (EDA)
Exploratory Data Analysis is the second step which involves looking at various statistics and visualizations generated from various dimensions of the dataset. This helps identifying anomalies, errors and identifying other areas of Machine Learning problems. For example, let us say we are looking to identify a dataset that has a column full of social security numbers. Just by computing a list of unique values whose frequency counts are above a threshold number such as 10, could lead us to see spurious numbers such as 000-00-0000. This is clearly wrong as we can see and could lead to problems in applying machine learning techniques. There are other ways by looking at the graphs too which we can identify such spurious data. Hence, this is the crucial step in the Data Science pipeline.

## 3. Problem Identification
Post EDA, we can identify if the any of the Machine Learning models such as prediction, clustering can be applied to the dataset.

## 4. Modeling
Modeling refers to applying Machine Learning models the dataset that follow basic principles. We shall study about these models in the future lessons.

## 5. Model Validation and Fine Tuning
The models we have built so far need to be validated for performance. Later, most often, a parameter tuning is performed which increases the performance of the model on the test dataset. 

## 6. Communicating Final Results
It is important to communicate final results to the business or non-technical audience. Hence, visualization forms an important part of Data Science. We shall learn how to couple great visualizations in your Data Science pipeline to effectively communciate your results.

## 7. Scaling and Big Data
The models that perform greatly on small datasets might not do so on large datasets due to the variance present in the dataset. Hence, working with big data and scaling up the algorithms is a challenge. The models are initially validated with small datasets before working with big data.

## 2. Exercise

#### Instructions
NBA Win Probabilities
The data contains every NBA team’s chance of winning in every minute across every Game:

https://fivethirtyeight.com/features/every-nba-teams-chance-of-winning-in-every-minute-across-every-game/

Given these probabilities, find out which team is most likely to win in all games?
Assign the team name to the variable, winning_team and print it out. 
Use the Hint feature to look up the command to print out the answer. 

In [4]:
import pandas as pd

winning_team = 'None'

nba_data = pd.read_csv("https://raw.githubusercontent.com/colaberry/538data/master/nba-winprobs/nba.tsv", sep = '\t')
winning_team_row = nba_data[nba_data['48'] == nba_data['48'].max()]

## 2. Solution