<a href="https://colab.research.google.com/github/dareoyeleke/python_scripting/blob/main/Pandas2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install datasets
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt

# Load data for use
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Little bit of data clean up
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

In [None]:
# We have a lot of NaN values in our DataFramee, and to clean the data to focus on only complete values, we can drop NaN values from our DataFrame

df.iloc[0:10] # while we can use .iloc method, it strictly uses indexing to call certain ranges of rows and colums

df.loc[0:, 'salary_rate':'salary_hour_avg'].dropna(subset='salary_rate')
# .loc[] on the other hand allows filtering with column names, conditional filters, booleans e.t.c, the .dropna(subset="salary_rate") allows to drop NaN values for salary_rate column

In [None]:
# Here we will be filling in Nan values for yearly and hourly salaries with median as well as removing duplicate inputs of data in the dataframe. First we calculate the median values for the hourly and yearly salary
median_yearly = df['salary_year_avg'].median()

median_hourly = df['salary_hour_avg'].median()

df_filled = df # to create a copy of the original dataframe to ensure original is not tampered with

df_filled['salary_year_avg'].fillna(median_yearly) # to substitute the NaN values in salary_year_avg column with the median value

df_filled['salary_hour_avg'].fillna(median_hourly) # to repeat the process for hourly NaN values

# However both columns will not have median values filled in until replaced with the object carrying the filled values like  below
df['salary_year_avg'] = df_filled['salary_year_avg'].fillna(median_yearly)

df['salary_hour_avg'] = df_filled['salary_hour_avg'].fillna(median_hourly)

df_filled # now we call the copy DataFrame and can see the NaN values have been filled

In [None]:
# Now to eliminate the duplicate values in the DataFrame. First we create another copy to make it easier to backtrack in case of any errors
df_dist = df_filled.copy()

df_dist.drop_duplicates() # we still have store the results in the initial object
df_dist = df_dist.drop_duplicates()

# to view the difference before and after dropping the duplicates
print("The number of rows before dropping duplicates were", len(df_filled))
print("The number of rows after dropping duplicates are now", len(df_dist))
print("The difference in number before and after dropping are", len(df_filled) - len(df_dist)) # the results printed below show the numbers of rows, before, after and the difference of row numbers with .drop_duplicates()

The number of rows before dropping duplicates were 785741
The number of rows after dropping duplicates are now 785640
The difference in number before and after dropping are 101


In [None]:
# However we will be going even deeper on dropping duplicates, using the job_title and company name to remove any duplicates, first we create another duplicate in case of errors
df_dist2 = df_dist.copy()
df_dist2 = df_dist2.drop_duplicates(subset=['job_title', 'company_name'])
# here we use a list passed as a subset in the drop_duplicates() method to get rid of duplicates in both columnms that may contain unique values generally applicable to the whole DataFrame

print("The first set of unique rows after dropping general duplicates were", len(df_dist))
print("After filtering for extra duplicates zoning down on job title and company name the numbers are", len(df_dist2))
print("The difference in rows after removing the first and second sets of duplicates are", len(df_dist) - len(df_dist2)) # the numbers before and after adding new columns to filter with and the difference is below



The first set of unique rows after dropping general duplicates were 785640
After filtering for extra duplicates zoning down on job title and company name the numbers are 508042
The difference in rows after removing the first and second sets of duplicates are 277598


In [None]:
'''
  DataFrame managament. Here we will be using .sample(). The .sample() method is effective in getting a represetative portion of a DataFrame,
which it does more effectively than the .head() and .tail() methods which print out a default first and last 5 rows of data. The .sample() returns random rows of data from a dataset
with one row being the default, and a specified number of fraction of the total as an argument
'''
df.sample() # returns only one random row
df.sample(10, random_state=42) # returns 10 random rows everytime it runs, to assign the sample, we can use inside the argument, "random_state" and assign it to a variable


In [None]:
# Here we will be creating pivot tables to view the median salary by job title and Country for the countries with the 6 highest amounts of job postings using the index and aggfunc arguments in the .pivot_table() method
