# Statistical Methods in Pandas - Exercise

## Introduction

+ In this lab you'll get some hands-on experience using some of the key summary statistics methods in Pandas.

## Objectives
You will be able to:

- Use the `df.describe()` and `df.info()` summary statistics methods  
- Use built-in Pandas methods for calculating summary statistics 
- Apply a function to every element in a DataFrame 


## Getting Started

For this lab, we'll be working with a dataset containing information on various lego datasets. You will find this dataset in the file `'lego_sets.csv'`.   

In the cell below:

- Import Pandas and set the standard alias of `pd`
- Load in the `'lego_sets.csv'` dataset using the `read_csv()` function
- Display the first five rows of the DataFrame to get a feel for what we'll be working with

![lego](https://media.giphy.com/media/103TZqgLqRJq0M/giphy.gif)

In [1]:
import pandas as pd
import numpy as np


In [6]:
lego = pd.read_csv("../data/lego_sets.csv")
lego.head(3)

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US


## Getting DataFrame-Level Statistics

We'll begin by getting some overall summary statistics on the dataset. There are two ways we'll get this information -- `.info()` and `.describe()`.

### Using `.info()`

The `.info()` method provides us metadata on the DataFrame itself. This allows us to answer questions such as:

* What data type does each column contain?
* How many rows are in my dataset? 
* How many total non-missing values does each column contain?
* How much memory does the DataFrame take up?

In the cell below, call our DataFrame's `.info()` method. 

In [8]:
lego.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ages               12261 non-null  object 
 1   list_price         12261 non-null  float64
 2   num_reviews        10641 non-null  float64
 3   piece_count        12261 non-null  float64
 4   play_star_rating   10486 non-null  float64
 5   prod_desc          11884 non-null  object 
 6   prod_id            12261 non-null  float64
 7   prod_long_desc     12261 non-null  object 
 8   review_difficulty  10206 non-null  object 
 9   set_name           12261 non-null  object 
 10  star_rating        10641 non-null  float64
 11  theme_name         12258 non-null  object 
 12  val_star_rating    10466 non-null  float64
 13  country            12261 non-null  object 
dtypes: float64(7), object(7)
memory usage: 1.3+ MB


#### Interpreting the Results

Read the output above, and then answer the following questions:

How many total rows are in this DataFrame?  How many columns contain numeric data? How many contain categorical data?  Identify at least 3 columns that contain missing values. 

Write your answer below this line:

In [9]:
# Check the shape of the DataFrame
print('Lego Data - rows:' , lego.shape[0],'columns:', lego.shape[1])

Lego Data - rows: 12261 columns: 14


In [10]:
# Checking for missing values
print("There is {} missing values in the dataframe".format(lego.isnull().sum().sum()))


There is 9245 missing values in the dataframe


In [11]:
# finding which columns have missing values
lego.isnull().sum()

ages                    0
list_price              0
num_reviews          1620
piece_count             0
play_star_rating     1775
prod_desc             377
prod_id                 0
prod_long_desc          0
review_difficulty    2055
set_name                0
star_rating          1620
theme_name              3
val_star_rating      1795
country                 0
dtype: int64

## Using `.describe()`

Whereas `.info()` provides statistics about the DataFrame itself, `.describe()` returns output containing basic summary statistics about the data contained with the DataFrame.  

In the cell below, call the DataFrame's `.describe()` method. 

![desc](https://media.giphy.com/media/26BRLZnuJ6OvdWXa8/giphy.gif)

In [12]:
lego.describe()

Unnamed: 0,list_price,num_reviews,piece_count,play_star_rating,prod_id,star_rating,val_star_rating
count,12261.0,10641.0,12261.0,10486.0,12261.0,10641.0,10466.0
mean,65.141998,16.826238,493.405921,4.337641,59836.75,4.514134,4.22896
std,91.980429,36.368984,825.36458,0.652051,163811.5,0.518865,0.660282
min,2.2724,1.0,1.0,1.0,630.0,1.8,1.0
25%,19.99,2.0,97.0,4.0,21034.0,4.3,4.0
50%,36.5878,6.0,216.0,4.5,42069.0,4.7,4.3
75%,70.1922,13.0,544.0,4.8,70922.0,5.0,4.7
max,1104.87,367.0,7541.0,5.0,2000431.0,5.0,5.0


#### Interpreting the Results

The output contains descriptive statistics corresponding to the columns. Use these to answer the following questions:

How much is the standard deviation for `piece count`?  How many pieces are in the largest lego set?  How many in the smallest lego set? What is the median `val_star_rating`?

________________________________________________________________________________________________________________________________

In [13]:
lego.columns

Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
       'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
       'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country'],
      dtype='object')

In [17]:
lego.piece_count.std().round(3)

825.365

In [20]:
lego.piece_count.max()

7541.0

In [21]:
lego.piece_count.min()

1.0

In [15]:
lego.val_star_rating.median()

4.3

## Getting Summary Statistics

Pandas also allows us to easily compute individual summary statistics using built-in methods.  Next, we'll get some practice using these methods. 

In the cell below, compute the median value of the `star_rating` column.

![sum](https://media.giphy.com/media/d5w5tJjb0NaGBaVSS2/giphy.gif)

In [22]:
lego["star_rating"].describe()

count    10641.000000
mean         4.514134
std          0.518865
min          1.800000
25%          4.300000
50%          4.700000
75%          5.000000
max          5.000000
Name: star_rating, dtype: float64

In [23]:
# median for star rating = 4.7 and compare it it the summary stats above
lego["star_rating"].median()

4.7

### Next, get a count of the total number of values in `play_star_rating`.

In [27]:
# answer should be 10486
lego["play_star_rating"].count()

10486

In [25]:
lego["play_star_rating"].nunique()

30

### Find the standard deviation for the list_price column

In [29]:
lego["list_price"].std().round(3)

91.98

### If we bought every single lego set in this dataset, how many pieces would we have?  

> **Note**: If you truly want to answer this accurately, and are up for the challenge, remove duplicate lego-set entries before summing the pieces. That is, many of the lego sets are listed multiple times in the dataset above, depending on the country where it is being sold and other unique parameters. If you're stuck, just practice calculating the total number of pieces in the dataset for now.

In [30]:
# if you simply want to calculate the sum of the column
lego.piece_count.sum()

6049650.0

### Now you can see that the above had so much duplicates

In [31]:
lego.drop_duplicates(subset="prod_id")["piece_count"].sum()

319071.0

### Now, let's try getting the value for the 90% quantile for all numerical columns.  Do this in the cell below.

In [32]:
lego.quantile(.90)

list_price            136.2971
num_reviews            38.0000
piece_count          1077.0000
play_star_rating        5.0000
prod_id             75531.0000
star_rating             5.0000
val_star_rating         5.0000
Name: 0.9, dtype: float64

## Getting Summary Statistics on Categorical Data

For obvious reasons, most of the methods we've used so far only work with numerical data -- there's no way to calculate the standard deviation of a column containing string values. However, there are some things that we can discover about columns containing categorical data. 

In the cell below, get the `.unique()` values contained within the `review_difficulty` column. 

![unique](https://media.giphy.com/media/MCXFAs15Nxcb4IuXFa/giphy.gif)

In [33]:
lego.review_difficulty.unique()

array(['Average', 'Easy', 'Challenging', 'Very Easy', nan,
       'Very Challenging'], dtype=object)

### Now, let's get the `value_counts()` for the `lego.review_difficulty` column, to see how common each is. 

In [34]:
lego.review_difficulty.value_counts()

Easy                4236
Average             3765
Very Easy           1139
Challenging         1058
Very Challenging       8
Name: review_difficulty, dtype: int64

### Alternatively we can do this for the above

In [35]:
lego.review_difficulty.value_counts(normalize=True)

Easy                0.415050
Average             0.368901
Very Easy           0.111601
Challenging         0.103665
Very Challenging    0.000784
Name: review_difficulty, dtype: float64

### As you can see, these provide us quick and easy ways to get information on columns containing categorical information.  


## Using `.applymap()`

When working with pandas DataFrames, we can quickly compute functions on the data contained by using the `.applymap()` method and passing in a lambda function. 

For instance, we can use `applymap()` to return a version of the DataFrame where every value has been converted to a string.

In the cell below:

* Call our DataFrame's `.applymap()` method and pass in `lambda x: str(x)`
* Call our new `string_df` object's `.info()` method to confirm that everything has been cast to a string

In [37]:
string_df = lego.applymap(lambda x: str(x))

# You can see that the DF has been cast into objects ie string

In [38]:
string_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   ages               12261 non-null  object
 1   list_price         12261 non-null  object
 2   num_reviews        12261 non-null  object
 3   piece_count        12261 non-null  object
 4   play_star_rating   12261 non-null  object
 5   prod_desc          12261 non-null  object
 6   prod_id            12261 non-null  object
 7   prod_long_desc     12261 non-null  object
 8   review_difficulty  12261 non-null  object
 9   set_name           12261 non-null  object
 10  star_rating        12261 non-null  object
 11  theme_name         12261 non-null  object
 12  val_star_rating    12261 non-null  object
 13  country            12261 non-null  object
dtypes: object(14)
memory usage: 1.3+ MB


Note that everything -- even the `NaN` values, has been cast to a string in the example above. 

Note that for Pandas Series objects (such as a single column in a DataFrame), we can do the same thing using the `.apply()` method.  

This is just one example of how we can quickly compute custom functions on our DataFrame -- this will become especially useful when we learn how to **_normalize_** our datasets in a later section!

## Summary

In this lab, we learned how to:

* Use the `df.describe()` and `df.info()` summary statistics methods 
* Use built-in Pandas methods for calculating summary statistics 
* Apply a function to every element in a DataFrame

![done](https://media.giphy.com/media/KGYB8ohERAEYbA1OkN/giphy.gif)