## Exploratory Data Analysis in Python

In [None]:
# Print the first five rows of unemployment
print(unemployment.head())

# Count the values associated with each continent in unemployment
print(unemployment["continent"].value_counts())

# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x="2021", binwidth=1)

plt.show()

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print the minimum and maximum unemployment rates during 2021
print(unemployment["2021"].min(), unemployment["2021"].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(data=unemployment, x="2021", y="continent")
plt.show()

In [None]:
# Print the mean and standard deviation of rates by year
print(unemployment.agg(["mean", "std"]))

# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby("continent").agg(["mean", "std"]))

continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021=("2021", "mean"),
    # Create the std_rate_2021 column
    std_rate_2021=("2021", "std")
)
print(continent_summary)

# Create a bar plot of continents and their 2021 average unemployment
sns.barplot(data=unemployment, x="continent", y="2021")
plt.show()

## Chapter 2

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x="Airline", y="Price")

plt.show()

How should you deal with the missing values in "Additional_Info" and "Price"?
Remove the "Additional_Info" column and impute the median by "Airline" for missing values of "Price".

In [None]:
# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()

print(airline_prices)

# Convert to a dictionary
prices_dict = airline_prices.to_dict()

# Map the dictionary to the missing values
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))

# Check for missing values
print(planes.isna().sum())

In [None]:
# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:
  
  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

In [None]:
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short_flights
short_flights = "^0h|^1h|^2h|^3h|^4h"

# Create medium_flights
medium_flights = "^5h|^6h|^7h|^8h|^9h"

# Create long_flights
long_flights = "10h|11h|12h|13h|14h|15h|16h"

In [None]:
# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()

In [None]:
# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)

# Plot a histogram
sns.histplot(data=planes, x="Duration", binwidth=10)
plt.show()

In [None]:
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x: x.median())

print(planes[["Airline","airline_median_duration"]].value_counts())

# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x: x.mean())

print(planes[["Destination","price_destination_mean"]].value_counts())

In [None]:
# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()

# Display descriptive statistics for flight duration
print(planes["Duration"].describe())

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]

print(planes["Price"].describe())

In [None]:
# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv("divorce.csv", parse_dates=["divorce_date", "dob_man","dob_woman", "marriage_date"])
print(divorce.dtypes)

# Convert the marriage_date column to DateTime values
divorce["marriage_date"] = pd.to_datetime(divorce["marriage_date"])

# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year

# Create a line plot showing the average number of kids by year
sns.lineplot(data=divorce, x="marriage_year", y="num_kids")
plt.show()

# Create the scatterplot
sns.scatterplot(data=divorce, x="marriage_duration", y="num_kids")
plt.show()

In [None]:
#Seaborn's .pairplot() is excellent for understanding the relationships between several or all variables in a dataset by aggregating pairwise scatter plots in one visual.
# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce, vars=["income_woman", "marriage_duration"])
plt.show()

# Create the scatter plot
sns.scatterplot(data=divorce, x="woman_age_marriage", y="income_woman", hue="education_woman")
plt.show()

# Create the KDE plot
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids")
plt.show()

# Update the KDE plot so that marriage duration can't be smoothed too far
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0)
plt.show()

# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()

## Chapter 4

In [None]:
# Print the relative frequency of Job_Category
print(salaries["Job_Category"].value_counts(normalize=True))

# Cross-tabulate Company_Size and Experience
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))

# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
            values=salaries["Salary_USD"], aggfunc="mean"))

In [None]:
# Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month

# Extract the weekday of the response
salaries["weekday"] = salaries["date_of_response"].dt.weekday
# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()

# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)

# Save the median
salaries_median = salaries["Salary_USD"].quantile(0.50)

# Gather the 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)

# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)

# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()

In [None]:
# Filter for employees in the US or GB
usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]

# Create a barplot of salaries by location
sns.barplot(data=usa_and_gb, x="Employee_Location", y="Salary_USD")
plt.show()

# Create a bar plot of salary versus company size, factoring in employment status
sns.barplot(data=salaries, x="Company_Size", y="Salary_USD", hue="Employment_Status")
plt.show()

## Recap

In [None]:
Your recent learnings
When you left 1 week ago, you worked on Getting to Know a Dataset, the first chapter of the course Exploratory Data Analysis in Python. Here is what you covered in your last lesson:

You learned about the importance of getting to know a dataset through validation and summarization techniques, focusing on both categorical and numerical data. Key points included:

Grouping Data: Using .groupby() to categorize data, which allows for subsequent aggregation functions like .mean() or .count() to summarize data within each category. For instance, grouping books by genre to find the average rating per genre.
Aggregation Functions: Exploring different aggregating functions such as .sum(), .min(), .max(), .var(), and .std() to describe data characteristics. These functions help in understanding the distribution and variability within the data.
Combining .groupby() and .agg(): You saw how to apply multiple aggregation functions to grouped data using .agg(), which can take a list of functions or a dictionary specifying functions per column. This is useful for detailed data exploration.
Named Aggregations: Creating more readable code and output by naming the results of aggregations, making it easier to understand the applied statistics at a glance.
For example, to summarize unemployment rates by continent with mean and standard deviation for 2021, you used:

continent_summary = unemployment.groupby("continent").agg( mean_rate_2021=("2021", "mean"), std_rate_2021=("2021", "std") )
Visualizing Data with Seaborn: Learning to use Seaborn for creating bar plots that automatically calculate and display the mean of a quantitative variable across categories, including a 95% confidence interval. This visual representation aids in quickly identifying patterns or outliers in the data.
By applying these techniques, you've gained valuable skills in data summarization and visualization, essential for any data analysis project.

Your recent learnings
When you left 9 days ago, you worked on Data Cleaning and Imputation, chapter 2 of the course Exploratory Data Analysis in Python. Here is what you covered in your last lesson:

You learned about converting and analyzing categorical data within a DataFrame, focusing on handling job titles in a dataset. Key points covered include:

Filtering Non-Numeric Data: You discovered how to use select_dtypes to isolate non-numeric columns, such as job titles, from a DataFrame. This method helps in focusing on categorical data for analysis.
Analyzing Value Frequencies: The nunique method was used to count unique job titles, revealing there were 50 different titles, with "Research Scientist" being a less common role.
String Manipulation for Data Filtering: You learned to use the str.contains method to filter rows based on whether they contain certain keywords, such as "Scientist". This is useful for narrowing down data to specific categories of interest.
Combining Filters: By using the pipe (|) symbol in str.contains, you combined filters to find job titles containing either "Machine Learning" or "AI", showcasing how to search for multiple conditions within a single column.
Creating New Categorical Columns: You created a new column, Job_Category, by defining a list of job roles and using NumPy's select function to categorize each job title based on predefined conditions. This allows for a more organized analysis of job categories.
Here's a snippet of code you worked with:

Create conditions for values in flight_categories to be created conditions = [ (planes["Duration"].str.contains(short_flights)), (planes["Duration"].str.contains(medium_flights)), (planes["Duration"].str.contains(long_flights)) ] # Apply the conditions list to the flight_categories planes["Duration_Category"] = np.select(conditions, flight_categories, default="Extreme duration")
This lesson equipped you with techniques to clean and categorize textual data, making it easier to analyze and visualize.

The goal of the next lesson is to introduce techniques for visualizing numeric data, enabling you to effectively communicate your findings and insights from your analysis.

Your recent learnings
When you left 12 days ago, you worked on Data Cleaning and Imputation, chapter 2 of the course Exploratory Data Analysis in Python. Here is what you covered in your last lesson:

You learned about handling numeric data in pandas, focusing on cleaning and transforming data for analysis. Key points included:

Removing commas and changing data types: You saw how to clean numeric data that was stored as text by removing commas from the Salary_In_Rupees column using Series.str.replace() and then converting the column to a float data type for further analysis.
Currency conversion: You learned to create a new column, Salary_USD, by converting Salary_In_Rupees to USD using a conversion rate, demonstrating the process of manipulating and adding new data to a DataFrame.
Calculating summary statistics: The lesson covered how to use pandas' groupby function and transform method to calculate and add summary statistics like mean and standard deviation to your DataFrame based on specific conditions, such as experience level.
Handling string values in numeric columns: You tackled cleaning a Duration column in a planes DataFrame, which involved converting string values to a numeric data type to enable analysis.
For example, to remove commas and convert a column to float, you used:

df['Salary_In_Rupees'] = df['Salary_In_Rupees'].str.replace(',', '').astype(float)
And to calculate and add a new column for standard deviation of salaries based on experience:

df['std_dev_salary'] = df.groupby('Experience')['Salary_USD'].transform(lambda x: x.std())
This lesson equipped you with practical skills for cleaning and preparing numeric data, setting a strong foundation for exploratory data analysis and data science projects.

The goal of the next lesson is to learn how to handle outliers in datasets by identifying, analyzing, and deciding the best approach to deal with them to ensure accurate data analysis.

Your recent learnings
When you left 6 days ago, you worked on Relationships in Data, chapter 3 of the course Exploratory Data Analysis in Python. Here is what you covered in your last lesson:

You learned about handling outliers, which are observations significantly different from other data points. For instance, in a dataset of house prices, a house priced at five million dollars could be an outlier if the median is $400,000, unless factors like location and size justify the price. Key points covered include:

Understanding Outliers: Recognizing that outliers can skew data analyses and may not accurately represent the dataset.
Identifying Outliers: Using the interquartile range (IQR) to mathematically define outliers. The IQR is the difference between the 75th and 25th percentiles, and outliers are typically any values 1.5 times the IQR above the 75th percentile or below the 25th percentile.
Calculating IQR and Outlier Thresholds:
IQR = Series.quantile(0.75) - Series.quantile(0.25)
upper_limit = 75th_percentile + 1.5 * IQR
lower_limit = 25th_percentile - 1.5 * IQR
Decision Making on Outliers: Deciding whether to keep, adjust, or remove outliers based on their relevance and accuracy.
Impact of Removing Outliers: Observing how outlier removal can lead to a more normally distributed dataset, which is crucial for many statistical tests and machine learning models.
You practiced identifying outliers using visualizations and learned techniques to remove them, thereby preparing your dataset for further analysis.

The goal of the next lesson is to learn how to enhance the quality of datasets by effectively handling missing data, ensuring more reliable and accurate data analysis.

When you left 2 days ago, you worked on Turning Exploratory Analysis into Action, chapter 4 of the course Exploratory Data Analysis in Python. Here is what you covered in your last lesson:

You learned about the relationships between various types of variables in datasets, focusing on categorical variables and their interactions with numerical ones. Specifically, you explored:

The concept of categorical variables, using the education_man variable to understand the distribution of education levels among men in a dataset. Categorical variables, unlike numerical ones, are best summarized and explored through visualizations rather than numerical summaries.
How to visualize the relationship between two variables using histograms and Seaborn's Kernel Density Estimate (KDE) plots. For instance, you examined the relationship between marriage duration and male education level, learning that KDE plots provide a clearer view of distribution peaks across different categories compared to histograms.
The importance of adjusting KDE plot parameters, such as the cut keyword, to avoid misleading representations in data visualization. You saw how setting cut=0 can limit the curve to realistic data ranges, eliminating impossible values like negative marriage durations.
Integrating categorical data into scatter plots to analyze relationships between numerical variables and categories. You created a scatter plot to investigate the correlation between the age at marriage and education level, using the hue argument to differentiate data points by education level.
The practical application of these concepts through exercises, including creating a scatter plot to explore the relationship between women's age at marriage, their income, and education level. The code snippet provided was:
# Create the scatter plot
sns.scatterplot(data=divorce, x="woman_age_marriage", y="income_woman", hue="education_woman")
plt.show()
Lastly, you delved into using KDE plots for comparing distributions across different categories, such as the number of kids in a marriage, to understand how certain factors might influence the duration of a marriage.
This lesson equipped you with tools to visualize and analyze the complex relationships between different types of variables in a dataset, enhancing your data exploration and interpretation skills.

The goal of the next lesson is to teach how to import, convert, manipulate, and visualize DateTime data in pandas for effective time-series analysis.