[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%204%20Notebooks/GDAN%205400%20-%20Week%204%20Notebooks%20%28VI%29%20-%20Frequencies.ipynb)

This notebook provides recipes for showing variable frequencies in PANDAS dataframes

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

In [None]:
#APPLY DATA CLEANING OPERATIONS FROM CODING ASSIGNMENT 1
df = df[df['Policy Number'].notnull()]
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)

# Column Frequencies in PANDAS

**[ChatGPT prompt]** `How can I get frequencies in PANDAS?`

### Using `value_counts()` in Pandas

The `value_counts()` method is a simple way to calculate the frequency of unique values in a pandas Series or DataFrame column. It returns the counts of unique values in descending order by default.

This is the best method when you want frequencies for a single column.

---

#### **1. Basic Usage**

`value_counts()` counts the occurrences of each unique value in the column.

```python
# Example DataFrame
import pandas as pd

data = {
    'Industry': ['Retail', 'Tech', 'Finance', 'Retail', 'Tech', 'Finance', 'Retail', 'Tech']
}
df = pd.DataFrame(data)
```

Using `value_counts()`
```python
print(df['Industry'].value_counts())
```


#### 2. Normalize to Get Proportions
You can normalize the counts to get proportions instead of absolute counts by setting `normalize=True`.
```python
print(df['Industry'].value_counts(normalize=True))
```


#### 3. Include Missing Values
By default, `value_counts()` excludes `NaN` values. You can include them by setting `dropna=False`.
```python
data_with_nan = {
    'Industry': ['Retail', 'Tech', 'Finance', 'Retail', 'Tech', None, 'Retail', 'Tech']
}

df_with_nan = pd.DataFrame(data_with_nan)

print(df_with_nan['Industry'].value_counts(dropna=False))
```

#### 4. Sorting the Output
You can sort the results in ascending order by setting `ascending=True`.
```python
print(df['Industry'].value_counts(ascending=True))
```

In [None]:
# Frequency of occupations
df['Occupation'].value_counts()

In [None]:
# Frequency (normalized)
df['Occupation'].value_counts(normalize=True)

In [None]:
# Frequency including NaN values
df['Occupation'].value_counts(dropna=False)

In [None]:
# Frequency sorting on frequencies – ascending values
df['Occupation'].value_counts(ascending=True)

In [None]:
# Frequency sorting on frequencies – descending values
df['Occupation'].value_counts(ascending=False)

In [None]:
# Frequency sorting on occupation
df['Occupation'].value_counts().sort_index()

# Summary of Options:
- Basic Counts: `df['Column_Name'].value_counts()`
- Normalized Counts: `df['Column_Name'].value_counts(normalize=True)`
- Include NaNs: `df['Column_Name'].value_counts(dropna=False)`
- Sort Ascending: `df['Column_Name'].value_counts(ascending=True)`