# Interactive Introduction to Data Science Methodology Notebook
![alt text](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png "Title")


The first step in data science methodology is to ask a question in which data science can be used to solve it. Don't get caught up in the tools themselves or the data first.

## __Step 1: Understanding different types of questions to ask in data science methodology__ 

### - Remember Questions: Who, what, where, or when did something happen?

   - Sample Question: What browser is a particular user using to browse this site?
   
   <br>
    
   - Requirements: data collection & manipulation in SQL, R or Python(used in this notebook)
   <br>  
   
### - Understand Questions: Can you summarize what happened?

   - Sample Question: What browser do my users tend to use?
   
   <br>
    
   - Requirements: data aggregation and summarization
   <br>   
   
### - Apply Questions: What happens when …? 
    
- Sample Question: Is there a relationship between time spent under sun and plant height?  
     
      Y = f(x) 
   -  Y is plant height  
   -  f is any model representative of the relationship 
   <br> 
   -  x is time plant spent in sun         
   <br> 
     
- Sample Question: Which animal is in a given image?        
   
      Y = f(x)
  -  Y is {dog, cat, horse, other}
  -  f represents any model that captures the relationship
      <br>
 -  x would be images encoded into tabular form
         
    <br> 
     
- Requirements: regression analysis(statistics modeling),classification(supervised learning/ML), hypothesis testing   
     <br>    
     
### - Analyze Questions: What are the key parts and relationships of …?
 
- Sample Question: What factors best predict electricity demand?
  
       Y =f(x)
       
   -  Y = quantity of electricity demanded
       <br>
   - f any model capturing relationship between your data and electricity demanded
       <br>
   - x probably has the features price, temperature, season, region,other features
       
     <br>
- Requirements: regression analysis/feature sel(remove factors/predicting elec), classification,clustering 
   
 
### - Evaluate Questions: Is this the best approach?

   
   - Sample Question: Can we save money by pricing different products better?
    
   <br>
   - Requirements: regression analysis, classification, scenario analysis prediction
  <br>
  
### - Create Questions: Can you predict what will happen to … under new conditions? 

   - Sample Question: Where should I place ad on webpage so viewer is most likely to click?
    
  <br>      
   - Requirements: optimization, experimentation
<br>   

## __Step 2: Choose a set of data science tools__

   - Factors to consider:
      
      <br> 
      
      - Type of data you will analyze: Unstructured vs. Structured
      
       <br> 
       
      - Language best suited for data: R, SQL, Python
      
      <br> 
      
      - Depending on the type of question you seek to answer will dictate method used e.g. data aggregation (preprocessing/prep of raw) vs. data wrangling(prep during analysis/model building)
      
      <br>  
      
      - Visualizations you'd like 

## __Step 3: Locate the data you will use for analysis into Python__

  Note the following: 
- For the purpose of this exercise we've already generated fake data(Mockaroo) stored in this Jupyter notebook.
- I am using Python3 for data wrangling(pandas, numpy, scipy)                              
- __We will now illustrate how to locate & read a JSON data file you want to use for analysis using the Python code below__
- __Tip: To execute the Python code in the grey code cell below, click on it and press Shift + Enter & then clear the cell's output (if you wish)__
                              
                              Training models(scikit-learn) 
                              Visualizations (bokeh,seaborn, matplotlib)

   


In [None]:
#testing out opening and reading JSON data file
with open("/users/brandyguillory/Jupyter Files/fakedata_mockaroo.json","r") as File1:
    Stuff_In_File = File1.read()
print(Stuff_In_File)



## Step 4: Importing the data to Python for analysis

There are two core libraries used for data analysis in Python: 
- the pandas library used for advanced data manipulation, stores data in dataframe
- the NumPy library for scientific computing, mathematical operations stores data in N dimensional arrays(1D,2D,3D)

We will demonstrate both, first importing our data into Python using these libraries via the code below:


In [None]:
#loading libraries in memory for us to work with our data
import pandas as pd
import numpy as np


 We will now store our JSON data file into a pandas dataframe using the executable code below:

In [None]:
#storing our JSON file to a dataframe & load first sheet of JSON file into pandas dataframe
jsonfile = '/users/brandyguillory/Jupyter Files/fakedata_mockaroo.json'
df = pd.read_json(jsonfile, orient='columns')
# There was 1000 rows in our JSON file of data, the below command will print our dataframe contents
df.head(1000)

We will now store our pandas dataframe into a numPY array using the executable code below:

In [None]:
numarray = df.values
print(numarray)