# Data Analysis Demo

What is Data Analysis? Simply put, analysing data to draw conclusions. Generally this involves two phases: preparing the data, and analysing the data.

## Preparing the Data

Before you begin to analyze data, you need to make sure that the data is ready to be analyzed. This involves things like:
1. Data Collection (typically at Scorpion the data collection is done automatically by various services)
2. Data Quality control - make sure the data was collected, reported, and stored in a way that doesn't have any inherent biases or errors.
3. Data Cleaning - often there will be missing data, or data you know is less reliable
4. Data Prep - often the data needs to be transformed or aggregated before analysis can begin

## Analysing the Data

This is the process of taking reliable data and trying to get business insights out of it. Here are the 4 main types of analysis:

1. Descriptive Analysis: Descriptive data analysis looks at past data and tells what happened. This is often used when tracking Key Performance Indicators (KPIs), revenue, sales leads, and more.
2. Diagnostic Analysis: Diagnostic data analysis aims to determine why something happened. Once your descriptive analysis shows that something negative or positive happened, diagnostic analysis can be done to figure out the reason. A business may see that leads increased in the month of October and use diagnostic analysis to determine which marketing efforts contributed the most.
3. Predictive Analysis: Predictive data analysis predicts what is likely to happen in the future. In this type of research, trends are derived from past data which are then used to form predictions about the future. For example, to predict next year’s revenue, data from previous years will be analyzed. If revenue has gone up 20% every year for many years, we would predict that revenue next year will be 20% higher than this year. This is a simple example, but predictive analysis can be applied to much more complicated issues such as risk assessment, sales forecasting, or qualifying leads.
4. Prescriptive Analysis: Prescriptive data analysis combines the information found from the previous 3 types of data analysis and forms a plan of action for the organization to face the issue or decision. This is where the data-driven choices are made.

# Demo
For this demo, I've selected a pretty famous and well studied dataset, the Ames Housing Dataset. Here's a link to a Kaggle page where you can download the data for yourself:
https://www.kaggle.com/c/home-data-for-ml-course/overview

I'll go over the basics of importing your data, summarising your data, basic cleaning, looking for insights and trends, and visualization. Of course, there's much more, but this should give a basic idea of what a day in the life of a data analyist might look like.

I downloaded the data from that website, and unzipped it all into my "data/" folder. Now I'm ready to begin!

First, I need to import the various libraries (or modules) that I'll be using. In Python, it's easy to import any library that you've already installed with pip or conda. In this case, I've installed these libraries (and many more) into a Conda Virtual Environment.

In each case, I'm importing "as", which just means I'm renaming them for my own purposes. That way, I can just write "pd" instead of writing "pandas" all the time. These are standard conventions for these libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now, before diving into the data itself, I'd like to read the "data_description.txt" file that came with the dataset. Of course, I could just open it in Windows, but why not practice using Python to read it? Now, it's very long, so I'll only print out the first 1000 characters for now, but I could print out the whole thing.

In [None]:
file_location = 'data/data_description.txt'
with open(file_location, 'r') as file:
    data_description = file.read()
    
first_thousand_chars = data_description[:1000]
print(first_thousand_chars)

Wow, that's a lot of info about the columns. I'll probably just focus on a few of these for now.

Ok, let's import our data. pandas is a wonderful library for reading in tabular data (data in columns and rows).

The basic unit of pandas is a dataframe, which is like a fancy excel spreadsheet. Let's create one.

Note: in Jupyter notebooks, I can just write "df" and it will print, for mildly complex reasons. In general, you'd need to type "print(df)".

In [None]:
df = pd.read_csv('data/train.csv')
df

pandas is smart, and only shows us a small sample of the data. We wouldn't be able to make sense of it if we saw all of it at once. By default, pandas limits to the first and last 5 rows, as well as the first and last 10 columns.

Let's see what all the columns are before we dive in any deeper.

In [None]:
df.columns

pandas has lots of useful summary tools built into it. let's use a very helpful one, .describe

This function prints out a number of useful statistics for each numerical columns.

In [None]:
df.describe()

Already we can see that there are some columns with missing data! Any column with a count less than 1495 is missing some data, as count looks for non-empty rows.

Some other things to cover:
1. Missing Data
2. Data Selection
2. Groupby
3. Graphing
5. Asking Questions of your Data

In [None]:
graph_input = df[['OverallCond', 'YrSold']].groupby('YrSold').mean().reset_index()
graph_input

In [None]:
graph_input
plt.scatter(x = graph_input.YrSold, y = graph_input.OverallCond);

# Asking Questions of your Data

## What's the 75th percentile of house prices?

In [None]:
price_75 = df.SalePrice.describe()['75%']
price_75

## What is the distribution of house quality in the 75th percentile?

In [None]:
answer_2 = df[df.SalePrice >= price_75].OverallQual.value_counts().sort_index().reset_index()
answer_2.columns = ['OverallQual', 'Count']
answer_2

## Make a bar chart for the previous info

In [None]:
sns.barplot(x=answer_2.OverallQual, y=answer_2.Count);

## How does that compare to the distribution for the bottom 25% by price?

In [None]:
# if you're doing something more than once, write a function
def distribution_of_qual_within_percentile(lower, upper):
    lower_price = df.SalePrice.describe()[lower]
    upper_price = df.SalePrice.describe()[upper]
    output =  df[(df.SalePrice >= lower_price) & (df.SalePrice <= upper_price)].OverallQual.value_counts().sort_index().reset_index()
    output.columns = ['OverallQual', 'Count']
    return output


answer_3 = distribution_of_qual_within_percentile('min', '25%')
sns.barplot(x=answer_3.OverallQual, y=answer_3.Count);