---
# Step 5: Understand Your Data
Now, we will experiment and explore different functions that summarize and describe the features of your data. This are not all of the functions that you could use, but they are the ones that I use the most! The goal here is to mess around with all of these functions and learn through trial-and-error.

---

## 1. Import packages

In [30]:
import pandas as pd
import numpy as np
from os import path as fp

## 2. Import the cleaned Data

### Import your cleaned data into this notebook:
- In the cell below import your cleaned data. If you need a refresher on using relative filepaths to import data, refer back to #2 in the [step 4 notebook](https://github.com/alexdsbreslav/python_for_uxr/blob/master/step4_clean_your_data/step4_workbook.ipynb) 

### Import the example data into this notebook:
I'll be using this data to show examples, so it may be helpful to have on hand!

In [62]:
example_df = pd.read_excel(fp.join(fp.dirname(fp.abspath('')), 'data', 'raw_data', 'user_data.xlsx'), engine='openpyxl')

## 3. Understanding Your Variables
First, we want to understand what each one of our variables looks like. This will help us understand the statistical properties of the data that we are most interested in and will inform our later analyses! You never want to jump straight to looking at the relationship between variables until you understand each individual variable first!

For example, imagine that you wanted to understand whether individuals in the US or Canadian markets used the search function more often in your app. If you immediately (and only) compared the mean number of searches between markets, you may find that they have the same means, and falsely conclude that there are no differences across markets. However, each market could have the same mean, while having completely different variability. If you knew that each market have different variability up front, you wouldn't have even compared means because you would have known it would be misleading! 

So let's start off with functions that will help you understand single variables in your data!

### Understanding Categorical and Ordinal Variables
For columns with categorical (e.g. `employment_status`, `gender_identity`, `market`) or ordinal variables (e.g. `how satisfied are you with our product?`) in them , we may ask simple questions like:

1. Are all of the options that we expect to see actually showing up in our data?
2. How frequent is each option? 
3. Which option shows up the most?

To check these basics, we can use the `unique()`, `value_counts()`, and `describe()` functions. For more information on what input these functions take, what they output, and what errors they may raise, check the documentation:
1. [unique() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html)
2. [value_counts() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
3. [describe() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

- In the cell below, try using each function to above to summarize a column with a categorical or ordinal variable in it.

If you are having troube selecting all of the data in one column, remember that you can use dot notation, brackets, or the `loc[]` function:
```python
df.name_of_your_column.unique() OR df['name_of_your_column'].value_counts() OR df.loc['name_of_your_column'].describe()
```

If you would like, you can use the `describe()` function on more than one column at a time!
- In the cell below, try selecting two or more columns of your dataframe and using the `describe()` function to summarize them at the same time.

If you are having troube selecting multiple columns, remember that you can use a list of column names inside your brackets:
```python
df[['name_of_your_column1', 'name_of_your_column2']].describe() OR 
df.[[i for i in df.columns if 'your_column_prefix' in i]].describe()
```

### Understanding Continuous Variables
For columns with continuous variables (e.g. `sessions_in_app_30_days`, `click_throughs`) in them , the two keys things to understand are the most common outcomes (central tendency) and the spread of the outcomes (variability):

1. The central tendency of a continuous variable is typically be described using the mean or median
2. The variability of a continuous variable is typically described using the standard deviation or the values at certain quantiles.

Different metrics are better in different circumstances, but here, we'll just look at all of them using the `describe()` function again!

- In the cell below, use the describe function to `describe()` to summarize a continuous variable.

Note that you can get most of the statistics listed in the describe function by replacing `.describe()` with the name of the statistic.
```python
df.name_of_your_column.mean()
df['name_of_your_column'].min()
```

If you want the value at any given quantile, you can use the `quantile()` function. 
```python
df.loc['name_of_your_column'].quantile(0.25)
```

For more information on what input these functions take, what they output, and what errors they may raise, check the documentation.
1. [quantile() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)

- In the cell below, calculate the range of a variable by subtracting the min from the max
- Next, calculate the interquartile range of a variable by subtracting Q1 (25th quantile) from Q3 (75th quantile)

## 4. Understanding the Relationship Between Variables
When we are simply exploring and understanding our data, we can use some simple techniques to assess the relationship between two variables. The basic idea is that we are going to group respondants based on one variable and compare the values of another variable between the two groups.

We'll use two important functions to do this: `groupby()`, and `cut()`. For more information on what input these functions take, what they output, and what errors they may raise, check the documentation:
1. [groupby() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
2. [cut() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

### Grouping by a Categorical Variable
We can use the `groupby()` function to group respondants by a categorical variables.

The `groupby()` functions needs several parts. First, you add a list of column names inside the groupby brackets. Here, my list is just one column.
```python
df.groupby(['name of the grouping column'])
```

Second, you specify which variable your interested in assessing.
```python
df.groupby(['name of the grouping column'])['name_of_column_of_interest']
```

Last, you specify how you want to aggregate all of the values in the `column_of_interest`. Some examples, include `count()`, `value_counts()`, `mean()`, `sum()`, `std()`, `min()`, `max()`.
```python
df.groupby(['name of the grouping column'])['name_of_column_of_interest'].mean()
```

In the example dataset, lets look at whether respondants in the US market are more or less likely to have an Android phone. Each variable is categorical:

In [44]:
example_df[['market', 'device_type']].head()

Unnamed: 0,market,device_type
0,Canada,android
1,US,android
2,US,android
3,Canada,apple
4,US,android


So we can use the `value_counts()` aggregating function to count the number of times `android` and `apple` show up in each group.

In [45]:
example_df.groupby(['market'])['device_type'].value_counts()

market  device_type
Canada  android        278
        apple           68
US      android        517
        apple          137
Name: device_type, dtype: int64

Counts are not particularly useful here because the US market is so much bigger than Canada. We can divide the value counts by the total number of users in each market to get proportions, which will be more useful!

In [47]:
example_df.groupby(['market'])['device_type'].value_counts()/example_df.groupby(['market'])['device_type'].count()

market  device_type
Canada  android        0.803468
        apple          0.196532
US      android        0.790520
        apple          0.209480
Name: device_type, dtype: float64

- In the cell below, investigate the relationship between two new categorical variables in the `example_df` or in your data.

You can use the describe function when your columns of interest is continuous.

In [48]:
example_df.groupby(['device_type'])['searches_in_app_30_days'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
device_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
android,795.0,3.972327,7.169537,0.0,0.0,1.0,4.0,38.0
apple,205.0,3.458537,6.623228,0.0,0.0,0.0,4.0,38.0


- In the cell below, investigate the relationship between a new categorical variable and continuous variable in the `example_df` or in your data.

### Grouping by a Continuous Variable
There may be instances where you want to group respondants by a continous variable. This is where the `cut()` function comes in handy. If we try to use the `groupby()` function on a continuous variable with lots of unique values, the output will not be very informative...

In [51]:
example_df.groupby(['sessions_in_app_30_days'])['searches_in_app_30_days'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sessions_in_app_30_days,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,251.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,134.0,1.074627,0.872529,0.0,0.0,1.0,2.0,2.0
2,134.0,1.970149,1.681122,0.0,0.0,2.0,4.0,4.0
3,151.0,2.980132,2.485881,0.0,0.0,3.0,6.0,6.0
4,136.0,4.058824,3.337253,0.0,0.0,4.0,8.0,8.0
5,18.0,6.111111,4.04226,0.0,5.0,5.0,10.0,10.0
6,11.0,5.454545,4.987256,0.0,0.0,6.0,9.0,12.0
7,13.0,7.538462,6.678515,0.0,0.0,7.0,14.0,14.0
8,12.0,8.0,6.822423,0.0,0.0,8.0,16.0,16.0
9,11.0,9.0,8.049845,0.0,0.0,9.0,18.0,18.0


We can use the `cut()` function to bin participants into groups. For example, we may want to understand differences in search behavior of individuals that opened the app 5 or fewer times, 6-10 times, or 11+ times. We'll use `cut()` to add a new column to our data and then use it as a grouping variable.

Remember that when we create data, and add it to our dataframe, we use the syntax...
```python
df['your_new_column_name'] = your new data
```

In [65]:
example_df['sessions_binned'] = pd.cut(example_df['sessions_in_app_30_days'], #column values to bin
                                       bins=[0,6,11,100], #bin min and max values
                                       right=False, #include the value on the left of the bin, exclude the value of the right
                                       labels=['0-5', '6-10', '11+']) #label the bins with text (rather than Interval data type)

Let's double check the data to make sure the bins line up with the `sessions_in_app_30_days` column.

In [67]:
example_df.head()

Unnamed: 0,email,market,device_type,monthly_active_user,account_type,sessions_in_app_30_days,searches_in_app_30_days,sessions_binned
0,example_email0@outlook.com,Canada,android,1,paid,1,1,0-5
1,example_email1@icloud.com,US,android,1,free,17,0,11+
2,example_email2@outlook.com,US,android,1,paid,4,4,0-5
3,example_email3@icloud.com,Canada,apple,1,paid,2,0,0-5
4,example_email4@gmail.com,US,android,0,paid,0,0,0-5


Now that our continuous variable is binned, we can use it as a grouping variable!

In [68]:
example_df.groupby(['sessions_binned'])['searches_in_app_30_days'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sessions_binned,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0-5,824.0,1.84466,2.540108,0.0,0.0,0.0,3.0,10.0
6-10,59.0,8.355932,6.885173,0.0,0.0,9.0,14.0,20.0
11+,117.0,15.846154,13.2604,0.0,0.0,16.0,26.0,38.0


- In the cell below, investigate the relationship bin a new continous variable in the `example_df` or in your data. Then use that binned continuous variable as the grouping variable to examine the relationship between two variables.

---

# Nice job!
You can now quickly explore and understand the basic features of your data!

You are now ready to start visualizing your data. Visualing your data is vital to develop a deeper understanding of your data and to share that understanding with stakeholders! To get started on the next step, [click here](https://github.com/alexdsbreslav/python_for_uxr/tree/master/step6_visualize_your_data) to see the instructions online. You can also open up the instructions on your computer by navigating to the `step6_visualize_your_data` folder and opening the `offline_README.pdf`.

---