![Alt text](image/banner.webp)

### Welcome to the Code:You
#### Module 2 Knowledge Check.

In the cell below, we will import the necessary packages to ensure you have everything you need to complete this assignment. If these packages are not already installed, they will be installed automatically.

In [None]:
import subprocess
import sys

def install_and_import(package):
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"{package} not found. Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        __import__(package)
        print(f"{package} has been installed.")

install_and_import('pandas')
install_and_import('matplotlib')

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker


The following block executes `api_call.py`, which retrieves data from the Louisville Open Data API. The data is then saved as a CSV file in the data folder. It could take a couple mins to run.

In [None]:
from api_call import data_creation

data_creation()

# Question 1 
---
Read in the data using the variable df

#### Question 2
Display the first 5 rows of data. 

#### Question 3
Display the last 5 rows of data. 

#### Question 4
Find the shape of the dataframe 

#### Question 5
Use the describe function to gain some insights to our data. 

#### Question 6
Find the unique vales in the CalYear column

#### Question 7
Filter the DataFrame to include only rows where the `CalYear` is `2024`, and assign the result to a variable named `df_2024`.

This will test to see if you filtered the dataframe correctly. 

In [None]:
try:
    if 'df_2024' in globals():
        if not df_2024[df_2024['CalYear'] != 2024].empty:
            years = df_2024['CalYear'].unique()
            print(f"Your dataframe still has {years} in the CalYear column.")
        else:
            print("Congratulations, you dropped the correct columns!")
    else:
        print("The dataframe 'df_2024' is not defined. Did you name the dataframe correctly?")
        
except NameError:
    print("Did you name the dataframe correctly?")

#### Question 8
Drop the columns in the `cols_to_drop` list. 

In [None]:
cols_to_drop = ['Annual_Rate', 'Overtime_Rate', 'Incentive_Allowance', 'Other', 'ObjectId']

# Your code here

This will check to see if the columns got dropped or not.

In [None]:
try:
    if 'df_2024' in globals():
        cols_to_drop = ['Annual_Rate', 'Overtime_Rate', 'Incentive_Allowance', 'Other', 'ObjectId']
        remaining_cols = [col for col in cols_to_drop if col in df_2024.columns]
        if remaining_cols:
            print(f"The following columns were not dropped: {remaining_cols}")
        else:
            print("Congratulations, all specified columns were successfully dropped!")
    else:
        print("The dataframe 'df_2024' is not defined. Did you name the dataframe correctly?")
        
except NameError:
    print("Did you name the dataframe correctly?")


#### Question 9
Remove rows from `df_2024` where the `YTD_Total` column contains either null values or zeros.

This will test to see if you removed rows with NaN or 0 values in 'YTD_Total'.

In [None]:
try:
    if 'df_2024' in globals():
        if df_2024['YTD_Total'].isna().any():
            print("There are still NaN values in the 'YTD_Total' column.")
        elif (df_2024['YTD_Total'] == 0).any():
            print("There are still rows with 'YTD_Total' equal to 0.")
        else:
            print("Congratulations, you've successfully removed rows with NaN or 0 values in 'YTD_Total'!")
    else:
        print("The dataframe 'df_2024' is not defined. Did you name the dataframe correctly?")
        
except NameError:
    print("Did you name the dataframe correctly?")


---
If all the above was done correctly the below code will run and plot our findings of annual salary vs YTD salary spend. 

In [None]:
grouped_df = df_2024.groupby('Department')[['Regular_Rate', 'YTD_Total']].sum()
grouped_df = grouped_df.sort_values(by='Regular_Rate', ascending=False).head(5)

fig, ax = plt.subplots(figsize=(14, 6))

grouped_df.plot(kind='line', marker='o', ax=ax, color=['grey', 'red'], legend=True)

ax.set_title('Annual Salary vs YTD Total by Department')
ax.set_xlabel('Department')
ax.set_ylabel('Value')

def millions(x, pos):
    'The two args are the value and tick position'
    return '%1.1fM' % (x * 1e-6)

ax.yaxis.set_major_formatter(ticker.FuncFormatter(millions))

ax.yaxis.set_major_locator(ticker.MaxNLocator(integer=True))

plt.xticks(rotation=45, ha='right')

ax.set_xticks(range(len(grouped_df)))
ax.set_xticklabels(grouped_df.index, rotation=45, ha='right')

ax.set_xlim([-0.5, len(grouped_df) - 0.5])

y_min, y_max = grouped_df.min().min() * 0.9, grouped_df.max().max() * 1.1
ax.set_ylim([y_min, y_max])

handles, labels = ax.get_legend_handles_labels()
ax.legend([handles[0], handles[1]], ['Salary', 'YTD Total'])

plt.grid(True)
plt.tight_layout()
plt.show()