## Introduction to Data Analysis

### What is data analysis?
- the process of inspecting, cleansing, and transforming/modeling data
- goal to discover useful information to support decision making

### What data analysis tools are there?
- Managed Tools
    - Qlik
    - tableau
    - looker
    - zoho
- Programming languages
    - python
    - R
    - Julia

### Why python?
- simple, good documentation and reference
- large library database
- free

### When to choose R?
- extreme performance required
- required R Studio

### What is the data analysis process?
- extraction
- cleansing
- wrangling - reshaping
- analysis
- action

### Difference between data scientist and data analysts
- data scientists are more technical
- data analysis creates reports

## Data Analysis Example A

### Read CSV into pandas
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

sales = pd.read_cvs()
```

### The Data Frame
```python
# view the structure of data, similar to a database schema
sales.info()
```

### Desribe the data
```python
# view data statistics
sales.describe()
```

### Plot Data
```python
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))
```

### Group data
```python
# count occurances of data
sales.['Age_Group'].value_count()
```

### Correlate Data
```python
# quick correlation analysis
sales.corr()

# graph the correlation
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);

# graph correlation manually selecting x and y data type (sctter plot)
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))
```

## Data Analysis Example B

### Data wrangling

```python
# quickly create new columns
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']

#plot the resulting column
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))
```

### Run operations on calculated columns
```python
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']

(sales['Calculated_Cost'] != sales['Cost']).sum()

sales['Unit_Price'] *= 1.03
```

### Filter data with loc() method
```python
sales.loc[sales['State'] == 'Kentucky']
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]
```