
### Pandas Data Management

Load data.


In [1]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

# DataFrame Copy
df_original = df.copy()


#### Copy

Recall from the last lesson, when we filled in missing values for median salary.

Here let's make a new dataframe df_altered and only make changes to it.


In [2]:
# Create new dataframe
df_altered = df_original

df_altered.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Let's fill in missing values with the median value.

In [3]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered.loc[:,'salary_year_avg'].fillna(median_salary)

In [4]:
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

Now let's inspect the altered DataFrame.

In [5]:
df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64



That was good...

But what about the original...


In [6]:
df_original.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64



Holdup!! How the heck did df_original get altered!?!

Well both the variables of df_original and df_altered are referencing the same DataFrame.


In [7]:
print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                1841458942224
ID of df_altered:                 1841458942224
Are the two dataframes the same?  True


Instead we can use the .copy() method

In [8]:
df_original = df.copy()
df_altered = df_original.copy()

print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                1841979757584
ID of df_altered:                 1841979758224
Are the two dataframes the same?  False


Now when we do this same operation:

In [9]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

The original dataframe doesn't get altered!

In [10]:
df_original.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Now that we've created a copy of our data, we want to start our analysis. But if we have a large set of data we only want to take a subset of data to make it more manageable. We can use sample() to get a random sample of the data.


### Sample

#### Notes

   - sample(): Random sample of items.

#### Examples

Let's get a random sample of the data. You could get a sample with a fixed row number.


In [11]:
df.sample(n=5)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
522561,Data Scientist,Data Research Science Masters Apprenticeship (...,"West End, Southampton, UK",via LinkedIn,Full-time,False,United Kingdom,2023-11-30 11:15:07,False,False,United Kingdom,,,,BBC,"['python', 'express']","{'programming': ['python'], 'webframeworks': [..."
582741,Business Analyst,QA Engineer,"Shanghai, China",via Trabajo.org,Full-time,False,China,2023-04-18 09:49:27,False,False,China,,,,Wish,"['python', 'java', 'golang']","{'programming': ['python', 'java', 'golang']}"
260401,Data Scientist,Sr Data Scientist,"San Antonio, TX",via BeBee,Full-time,False,"Texas, United States",2023-02-03 07:06:45,False,False,United States,,,,"H-E-B, L.P.","['python', 'r', 'sql', 'databricks', 'pyspark'...","{'cloud': ['databricks'], 'libraries': ['pyspa..."
22433,Data Scientist,Global Data Scientist,Anywhere,via Indeed,Full-time,True,Sudan,2023-01-12 14:21:58,False,False,Sudan,,,,Kimberly-Clark,"['sql', 'r', 'python', 'azure', 'snowflake', '...","{'analyst_tools': ['word', 'sap', 'powerpoint'..."
33262,Data Scientist,Synopsia/ Offre d'emploi Data Scientits,"Paris, France",via Indeed,Full-time,False,France,2023-12-14 13:15:15,False,False,France,,,,Synopsia Ingénierie,"['sql', 'python', 'gcp', 'azure', 'oracle', 'p...","{'analyst_tools': ['power bi'], 'cloud': ['gcp..."


Or you can randomly select a fraction of the data (e.g., 10% of the rows), with or without replacement.

In [12]:
df.sample(frac=0.1, replace=False)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
589880,Senior Data Scientist,Senior Data Scientist,"Bengaluru, Karnataka, India",via Trabajo.org,Full-time,False,India,2023-08-02 09:13:40,False,False,India,,,,LegalZoom,['sql'],{'programming': ['sql']}
322589,Data Engineer,Data Engineer - WeKnowtice,"Almere, Netherlands",via Indeed,Full-time,False,Netherlands,2023-02-20 17:22:48,True,False,Netherlands,,,,Move Ahead,"['python', 'sql', 'snowflake', 'azure', 'jenki...","{'cloud': ['snowflake', 'azure'], 'other': ['j..."
751897,Data Analyst,Audit Data & Analytics Specialist,"Amsterdam, Netherlands",via Indeed,Full-time,False,Netherlands,2023-09-19 13:02:43,False,False,Netherlands,,,,KPMG,['alteryx'],{'analyst_tools': ['alteryx']}
514751,Data Analyst,Fraud Analytics,"Kuala Lumpur, Federal Territory of Kuala Lumpu...",via BeBee Malaysia,Full-time,False,Malaysia,2023-10-05 11:31:34,True,False,Malaysia,,,,Shopee,['sql'],{'programming': ['sql']}
163927,Data Scientist,Entry/Junior Level Data Scientist/Python Progr...,"Houston, TX",via Trabajo.org,Full-time,False,Sudan,2023-07-19 15:35:59,False,False,Sudan,,,,SynergisticIT,"['go', 'java', 'javascript', 'c++', 'sas', 'sa...","{'analyst_tools': ['sas', 'tableau'], 'librari..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
614466,Senior Data Analyst,Lead Risk Analytics Consultant,"Concord, CA",via The Muse,Full-time,False,"California, United States",2023-08-02 12:01:31,True,True,United States,,,,Wells Fargo,"['sas', 'sas', 'sql', 'python', 'tableau']","{'analyst_tools': ['sas', 'tableau'], 'program..."
61896,Business Analyst,"Junior Business Analyst, Reporting & Analytics","Taguig, Metro Manila, Philippines",via LinkedIn,,False,Philippines,2023-04-14 06:55:06,False,False,Philippines,,,,Northern Trust Corporation,['sql'],{'programming': ['sql']}
601253,Data Scientist,Director of Security Analytics Products / Lead...,"Warsaw, Poland",via LinkedIn,Full-time,False,Poland,2023-08-26 09:44:07,False,False,Poland,,,,Visa,"['python', 'java', 'golang', 'c++', 'sql']","{'programming': ['python', 'java', 'golang', '..."
358091,Data Analyst,Sales Data Analyst,Colombia,via Recruit.net,Full-time,False,Colombia,2023-02-15 00:15:04,True,False,Colombia,,,,Frubana,"['sql', 'excel']","{'analyst_tools': ['excel'], 'programming': ['..."


Now you can analyze these subsets of data.