# pandas exploration
In this assignment you will select a data set and do some munging and analysis of it using `pandas`, Jupyter Notebooks, and associated Python-centric data science tools.

## Set up environment

The following lines ensure that `numpy` and `pandas` are installed in the notebook environment.  Depending on your system, this may not be necessary and may be removed.

In [1]:
!pip install numpy
!pip install pandas


[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: C:\Users\86136\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: C:\Users\86136\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Import the core data science libraries:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Import the raw data
In this section, you will import the raw data into a `pandas` DataFrame.

In [3]:
# place your code into this Code cell
df = pd.read_csv('data/data.csv')

## Data inspection
In this section, you will show enough of your data for a viewer to get a general sense of how the data is structured and any unique features of it.  Complete each of the indicated tasks in a Code cell, making sure to include a Markdown cell above each Code cell that explains what is being shown by the code.  
- Show 5 rows, selected at random, from the data set.
- Show each of the column names and their data types.
- Show any unique features of your chosen data set.

Feel free to add as many additional cells as you need to help explain the raw data.

1. Show 5 rows randomly

In [6]:
df.sample(5)

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
597,2013,Chronic Lower Respiratory Diseases (J40-J47),F,White Non-Hispanic,521,36.6,21.6
652,2012,Diabetes Mellitus (E10-E14),F,Black Non-Hispanic,409,39.1,33.8
67,2011,Human Immunodeficiency Virus Disease (HIV: B20...,M,Hispanic,162,14.1,16.2
622,2007,"Accidents Except Drug Posioning (V01-X39, X43,...",M,Not Stated/Unknown,8,.,.
966,2011,Certain Conditions originating in the Perinata...,M,Asian and Pacific Islander,19,3.6,4.2


2. Show eache of the column names and their data types

In [7]:
df.dtypes

Year                        int64
Leading Cause              object
Sex                        object
Race Ethnicity             object
Deaths                     object
Death Rate                 object
Age Adjusted Death Rate    object
dtype: object

3. Show unique features of the data set

a.  Leading Causes of Death Distribution  
This step focuses on understanding the different categories within the 'Leading Cause' column. By counting the occurrences of each cause of death, we can identify the most common and the least common causes recorded in the dataset.

In [8]:
leading_cause_distribution = df['Leading Cause'].value_counts()
leading_cause_distribution

Leading Cause
Influenza (Flu) and Pneumonia (J09-J18)                                                                                              96
Diseases of Heart (I00-I09, I11, I13, I20-I51)                                                                                       96
Malignant Neoplasms (Cancer: C00-C97)                                                                                                96
All Other Causes                                                                                                                     96
Diabetes Mellitus (E10-E14)                                                                                                          92
Cerebrovascular Disease (Stroke: I60-I69)                                                                                            90
Chronic Lower Respiratory Diseases (J40-J47)                                                                                         88
Accidents Except Drug Posioning (V

b.  Gender Distribution  
Here, we're examining the 'Sex' column to see how the records are distributed between different genders. It provides a basic demographic breakdown and helps us understand if there's a balance between genders within the dataset.

In [9]:
sex_distribution = df['Sex'].value_counts()
sex_distribution

Sex
F    554
M    540
Name: count, dtype: int64

c.  Race and Ethnicity Distribution
The 'Race Ethnicity' column categorizes the data by race and ethnicity. Analyzing this column shows the demographic diversity of the dataset and highlights the representation of various racial and ethnic groups.

In [10]:
race_ethnicity_distribution = df['Race Ethnicity'].value_counts()
race_ethnicity_distribution

Race Ethnicity
Not Stated/Unknown            200
Other Race/ Ethnicity         186
Black Non-Hispanic            178
Hispanic                      177
Asian and Pacific Islander    177
White Non-Hispanic            176
Name: count, dtype: int64

d.  Year Distribution
By looking at the 'Year' column, we can evaluate the dataset across different time points. This helps us see if certain years have more records than others, which might be indicative of changes in data collection methods or actual trends in mortality.

In [11]:
year_distribution = df['Year'].value_counts()
year_distribution

Year
2011    141
2007    141
2010    138
2008    136
2014    136
2009    135
2012    134
2013    133
Name: count, dtype: int64

e.  Deaths Statistics
In this step, we perform a statistical analysis on the 'Deaths' column, which provides insights into the quantitative aspect of the data such as the total number of deaths recorded, average, range, etc.

In [12]:
deaths_statistics = df['Deaths'].describe()
deaths_statistics

count     1094
unique     465
top          .
freq       138
Name: Deaths, dtype: object

## Data munging
Place your **data munging** code and documentation within this section.  
- Keep each of your Code cells short and focused on a single task.  
- Include a Markdown cell above each code cell that describes what task the code within the code cell is performing.
- Make as many code cells as you need to complete the munging - a few have been created for you to start with.
- Display 5 sample rows of the modified data after each transformation so a viewer can see how the data has changed.

**Note**: If you believe that your data set does not require any munging, please explain in detail.  Create Markdown cells that explain your thinking and create Code cells that show any specific structures of the data you refer to in your explanation.

## Data analysis
Place your **data analysis** code and documentation within this section.
- Perform at least 5 different statistical or other analyses of different aspects of the data.
    - Your analyses must be specific and relevant to your chosen data set and show interesting aspects of it.
    - Include at least one analysis that includes grouping rows by a shared attribute and performing some kind of statistical analysis on each group.
    - Sort the data in at least 1 of your analyses, but sort on its own does not constitute an analysis on its own.
- Keep each of your Code cells short and focused on a single task.
- Include a Markdown cell above each Code cell that describes what task the code within the Code cell is performing.
- Make as many code cells as you need to complete the analysis - a few have been created for you to start with.

## Data visualization
In this section, you will create a few **visualizations** that show some of the insights you have gathered from this data.
- Create at least 5 different visualizations, where each visualization shows different insights into the data.
- Use at least 3 different visualization types (e.g. bar charts, line charts, stacked area charts, pie charts, etc)
- Create a Markdown cell and a Code cell for each, where you explain and show the visualizations, respectively.
- Create as many additional cells as you need to prepare the data for the visualizations.