# Interactive Introduction to Data Science Methodology Notebook
![alt text](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png "Title")

Within the umbrella of data science are different subset areas such as data analytics, machine learning, deeplearning/neural networks and AI.

![datasciencemethodologyumbrella.png](img/datascienceumbrella.png)


**Data Analytics' subset has 4 main parts each serving a different purpose:**

- Descriptive Question: What has happened? (Think visual representation of events)
- Diagnostic Question: Why did this happen?(Think root cause)
- Prescriptive Question: How can we fix it? (Think health industry recommending a medicine or course of treatment based on symptoms)
- Predictive Question: What will they want next?(Think recommender systems like Amazon, Netflix)

[tools&libsusedfordatascience](tools&libsusedfordatascience.txt)


## What is Data Science & Its Methodology?
![datasciencemethodologyflowchart.png](img/datasciencemethodologyflowchart.png)



The first step in data science methodology is to ask a question in which data science can be used to solve it. Don't get caught up in the tools themselves or the data first.Data science methodology begins with spending the time to seek clarification, by asking clarifying questions to attain what can be referred to as a business understanding, then you determine which data will be used to answer the core question.

List of Data Science Questions & Associated Algorithms: 
[QuestionsDataScienceCanAnswer_AlgorithmsAssociated](https://github.com/compscicoach/introtodatasciencemethodology/blob/master/QuestionsDataScienceCanAnswer_AlgorithmsAssociated.xlsx)


# A Day in the Life of a Data Scientist Script


## __Step 1: Business Understanding (The Business Problem: What is the Question I am trying to solve in which applying data science methodology will help to answer?)

I am Dan the data scientist and my boss Mr. Washington has asked me to solve the following business problem: What gender has purchased more vehicles this year? Is there a specific make/model of vehicle preferred by gender to target a campaign? Are there specific features preferred most?

To Clarify I ask Mr. Washington these clarifying questions: Do you hope to increase profitability of a certain model? overall profitability? He tells me he wishes to potentially replace an underperforming model and provide recommendations to the customer based on preferences.He wants to know what is the view of his business
## __Step 2: Analytic Approach (Choosing the right Machine Learning Algorithm Used to Solve)

Sample Questions & Regression(Statistical Modeling) or Classification approach to obtaining the answer)  

- Regression(Relationship) Sample Question: Is there a relationship between time spent under sun and plant height?    
      Y = f(x) 
   -  Y is plant height  
   -  f is any model representative of the relationship 
   <br> 
   -  x is time plant spent in sun         
   <br>     
- Classification Yes/No Sample Question: Which animal is in a given image?           
      Y = f(x)
  -  Y is {dog, cat, horse, other}
  -  f represents any model that captures the relationship
      <br>
 -  x would be images encoded into tabular form     
    <br>         
- Regression(Prediction) Sample Question: What factors best predict electricity demand?
       Y =f(x)
       
   -  Y = quantity of electricity demanded
       <br>
   - f any model capturing relationship between your data and electricity demanded
       <br>
   - x probably has the features price, temperature, season, region,other features       
     <br>
## __Step 3: Data Requirements (Requirements of the Business)

### Sample Requirements: Goal is to predict x y z outcome/understand combination of events causing prediction of those outcomes in an easy to understand format to apply predictions
   
- What does the picture of my business look like?    
- Which car models are selling the least?
- What factors can impact what recommendations I give to my customers?

## __Step 4: Data Science Tools chosen based on Step 1,2,3 above__

   - Factors to consider:    
      <br>     
      - Type of data I am analyzing: Unstructured, Structured, Semi-structured
       <br>    
      - Language best suited for data: R, SQL, Python  
      <br>    
      - Depending on the type of question you seek to answer will dictate method used e.g. data aggregation (preprocessing/prep of raw) vs. data wrangling(prep during analysis/model building   
      <br>        
      - Visualizations you'd like    

## __Step 5: Locate the data you will use for analysis into Python(Data Collection)

  Note the following: 
- For the purpose of this exercise we've already generated fake data(Mockaroo) stored in this Jupyter notebook.
- __We will now illustrate how to locate & read a JSON data file(semi-structured) you want to use for analysis using the Python code below__
- __For the purpose of this notebook illustration I only use data from one source but normally data aggregation is a step in the process, which takes multiple data sources consolidates into one location for analysis__
- __Tip: To execute the Python code in the grey code cell below, click on it and press Shift + Enter & then clear the cell's output (if you wish)__

In [4]:
#testing out opening and reading JSON data file
with open("/users/brandyguillory/Jupyter Files/pricedata.json","r") as File1:
    Stuff_In_File = File1.read()
print(Stuff_In_File)

[{"currency":"Yuan Renminbi","currency code":"CNY","gender":"Male","race":"Samoan","job title":"Marketing Assistant","car model year":1984,"car make":"Mazda","car model":"RX-7","price":"4,323 Yuan Renminbi"},
{"currency":"Euro","currency code":"EUR","gender":"Male","race":"Creek","job title":"Chief Design Engineer","car model year":1997,"car make":"Buick","car model":"Park Avenue","price":"95,173 Euro"},
{"currency":"Shilling","currency code":"UGX","gender":"Female","race":"Costa Rican","job title":"Senior Cost Accountant","car model year":1995,"car make":"Mitsubishi","car model":"Pajero","price":"458,483 Shilling"},
{"currency":"Dollar","currency code":"CAD","gender":"Female","race":"Seminole","job title":"Assistant Manager","car model year":2000,"car make":"Chevrolet","car model":"Tracker","price":"08,262 Dollar"},
{"currency":"Euro","currency code":"EUR","gender":"Female","race":"Laotian","job title":"Senior Financial Analyst","car model year":1998,"car make":"Ford","car model":"Tau

## Step 6: Importing the data to Python for analysis (Data Understanding - Iterative)

There are two core libraries used for data analysis in Python: 
- the pandas library essential to data analysis & used for advanced data manipulation, stores data in dataframe
- the NumPy library for scientific computing, mathematical operations stores data in N dimensional arrays(1D,2D,3D)

We will demonstrate both, first importing our data into Python for data analysis using these libraries via the code below:

In [5]:
#loading libraries in memory for us to work with our data
import pandas as pd
import numpy as np

In [6]:
#storing our JSON file to a dataframe & load first sheet of JSON file into pandas dataframe
jsonfile = '/users/brandyguillory/Jupyter Files/pricedata.json'
df = pd.read_json(jsonfile, orient='columns')
# There was 1000 rows in our JSON file of data, the below command will print our dataframe contents
df.head(1000)

Unnamed: 0,car make,car model,car model year,currency,currency code,gender,job title,price,race
0,Mazda,RX-7,1984,Yuan Renminbi,CNY,Male,Marketing Assistant,"4,323 Yuan Renminbi",Samoan
1,Buick,Park Avenue,1997,Euro,EUR,Male,Chief Design Engineer,"95,173 Euro",Creek
2,Mitsubishi,Pajero,1995,Shilling,UGX,Female,Senior Cost Accountant,"458,483 Shilling",Costa Rican
3,Chevrolet,Tracker,2000,Dollar,CAD,Female,Assistant Manager,"08,262 Dollar",Seminole
4,Ford,Taurus,1998,Euro,EUR,Female,Senior Financial Analyst,"6,854 Euro",Laotian
5,Dodge,Stratus,1996,Naira,NGN,Female,Registered Nurse,"20,459 Naira",Hmong
6,Acura,RL,2001,Real,BRL,Male,Product Engineer,"6,635 Real",Lumbee
7,Volkswagen,Passat,2012,Real,BRL,Female,Programmer Analyst III,"59,099 Real",Melanesian
8,Lexus,ES,2006,Euro,EUR,Male,Financial Advisor,"4,144 Euro",Samoan
9,Chrysler,Town & Country,2006,Ruble,RUB,Male,Accountant III,"440,768 Ruble",Mexican


## __Step 7: Understand variables/data spreads & convert to dataset (Data Understanding & Data Preprocessing - Iterative)

**Data Understanding:**

Look at Summary statistics and visualizations
   - Percentiles can help identify the range for most of the data
   - Averages and medians can describe central tendency
   - Correlations can indicate strong relationships
    
Visualize the data
   - Box plots can identify outliers
   - Density plots and histograms show the spread of data
   - Scatter plots can describe bivariate relationships
   
**Preprocessing is taking the raw data as we have done above(JSON) and convert to a client data set in four steps:**

- Data Cleaning: handling and correcting the following
        - missing data
        - noisy data(meaningless/corrupted)
        - detection and removal of outliers
        - minimizing duplication and computed biases within the data
        
- Data Integration:(Data aggregation part of process) take vast quantities of data from disparate sources combined to                    form consistent data. This consistent data after performing data cleaning is used for analysis

- Data Transformation: convert the raw data into a specified format according to the need of the model:
       - Normalization - In this method, numerical data is converted into the specified range, i.e., between 0 and one          so that scaling of data can be performed.
       - Aggregation - The concept can be derived from the word itself, this method is used to combine the features            into one. For example, combining two categories can be used to form a new group.
       - Generalization - In this case, lower level attributes are converted to a higher standard.
       
- Data Reduction: redundancy within the data is removed and efficiently organize the data


We will now store our JSON data file into a pandas dataframe using the executable code below and we will perform the data understanding step of data cleaning:

We will now store our pandas dataframe into a numPY array using the executable code below:

In [7]:
numarray = df.values
print(numarray)

[['Mazda' 'RX-7' 1984 ... 'Marketing Assistant' '4,323 Yuan Renminbi'
  'Samoan']
 ['Buick' 'Park Avenue' 1997 ... 'Chief Design Engineer' '95,173 Euro'
  'Creek']
 ['Mitsubishi' 'Pajero' 1995 ... 'Senior Cost Accountant'
  '458,483 Shilling' 'Costa Rican']
 ...
 ['Buick' 'Skylark' 1990 ... 'Analyst Programmer' '8,018 Euro'
  'Paraguayan']
 ['Mercury' 'Cougar' 1969 ... 'Geologist III' '76,353 Euro' 'Pueblo']
 ['Kia' 'Optima' 2010 ... 'Assistant Media Planner' '2,185 Yuan Renminbi'
  'Filipino']]


## __Step 8: Modeling the Data__

Goal?

Data Wrangling is performed during the iterative analysis and model building(enrich, cleanse, structure, publish documented process)
- I am using Python3 for data wrangling(pandas, numpy, scipy) 

## __Step 8: Validating the Model(Evaluation)__

Goal - accuracy?

Training models(scikit-learn) 
Visualizations (bokeh,seaborn, matplotlib)

## __Step 9: Deploying the Model__

## __Step 10: Updating the Model & Keep it Relevant(Feedback)__

Data
Improvement on model(add more enrichments etc)