## Task 1: Common steps of a Data Analysis Pipeline

Here are some common steps of an analysis pipeline (the order isn’t set, and not all elements are necessary):

1. Load Data

    - Check file types and encodings.

    - Check delimiters (space, comma, tab).

    - Skip rows and columns as needed.

2. Clean Data

    - Remove columns not being used.

    - Deal with “incorrect” data.

    - Deal with missing data.

3. Process Data

    - Create any new columns needed that are combinations or aggregates of other columns (examples include weighted averages, categorizations, groups, etc…).

    - Find and replace operations (examples inlcude replacing the string ‘Strongly Agree’ with the number 5).

    - Other substitutions as needed.

    - Deal with outliers.

4. Wrangle Data

    - Restructure data format (columns and rows).

    - Merge other data sources into your dataset.

5. Exploratory Data Analysis (not required for this Task).

6. Data Analysis (not required for this Task).

7. Export reports/data analyses and visualizations (not required for this Task).

For this Task, I will only ask you to set up a partial pipeline for the data loading, cleaning, processing, and wrangling steps.

## 1. Load Data

    - Check file types and encodings.

    - Check delimiters (space, comma, tab).

    - Skip rows and columns as needed.

In [2]:
import pandas as pd

df = pd.read_csv("insurance.csv") #Load dataframe
df.head() #Display first 5 rows

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 2. Clean Data

    - Remove columns not being used.

    - Deal with “incorrect” data.

    - Deal with missing data.

In [3]:
df2 = df
df2.drop(["children", "region"], inplace=True, axis=1)  #Drops the children and region columns
df2.head()
# df["smoker"].map(dict(yes=1, no=0))
# df2.replace(["yes", "no"], [1, 0]) #Replaces yes and no with 1 and 0

Unnamed: 0,age,sex,bmi,smoker,charges
0,19,female,27.9,yes,16884.924
1,18,male,33.77,no,1725.5523
2,28,male,33.0,no,4449.462
3,33,male,22.705,no,21984.47061
4,32,male,28.88,no,3866.8552


## 3. Process Data

    - Create any new columns needed that are combinations or aggregates of other columns (examples include weighted averages, categorizations, groups, etc…).

    - Find and replace operations (examples inlcude replacing the string ‘Strongly Agree’ with the number 5).

    - Other substitutions as needed.

    - Deal with outliers.

## 4. Wrangle Data

    - Restructure data format (columns and rows).

    - Merge other data sources into your dataset.

# Task 2: Method Chaining and writing Python programs

In [4]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine


data = pd.read_csv("insurance.csv") # this a data file that gets loaded
data = load_wine()
# Method chaining begins

# df = (   
#     pd.DataFrame(data.data,columns=data.feature_names)
#     .rename(columns={"color_intensity": "ci"})
#     .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
#     .loc[lambda x: x['alcohol']>14]
#     .sort_values("alcohol", ascending=False)
#     .reset_index(drop=True)
#     .loc[:, ["alcohol", "ci", "hue"]]
# )

# df
data

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Task 3: Conduct an Exploratory Data Analysis (EDA) on your dataset

# Task 4. Conduct your analysis to help answer your research question(s)

Dawson: what is the average increase in cost smokers pay as compared to their non-smoking counterpoint?

Sean: What is the relationship between having children and increased medical costs. How does this relationship change as the number of children changes?