# **Cleaning and manipulating data**


## Introduction

Your job as an analyst is to extract meaningful information about the workforce from available data-but before that happens, you have to check to see that the data you have is accurate and complete, and that it satisfies the requirements for your analysis. 

A common starting point for analytics projects is preparing roster information from HR Information Systems for analysis. These data include one row of data per employee, and multiple columns that contain specific information about the workforce (such as name, email address, department, tenure, birthday, and other demographic information). This information can also be limited to a specific population in a particular status, such as active employees (headcount), terminated employees, transfers in, transfers out, and more. 

In this lession we'll work with a a synthetic headcount (roster) dataset which features the structure and contents seen in many HRIS data extract. 

We'll work to highlight things common things you might have to do when preparing data for analysis, such as **creating columns, dropping records, replacing values, merging (vlookup) data from other files, and more.** In the next section, we will perform some basic summarization of the data we clean and produce in this section.

Finally, this is a simplified tutorial. It's worth taking a look at the __[Panda's Data Wrangling cheat-sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)__ and the  __[Pandas library's 10-minute introduction to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)__  for a more technical introduction to common tasks you'll need to perform in data claning tasks. For now, this is a general overview that's hopefully useful to you to get started working with HR data.


##### Note:

`#` in the code blocks below denotes a comment that I have added for you to better understand what the code is saying/doing.

<a id="section_a"></a>
## **Preparation and loading in sample data for this exercise**

First, like the beginning of any Python script, we need to import certain code libraries that will help us do our work. 

We will be using only the pandas library in this exercise, which we will refer to as (`pd`) for brevity. 


In [None]:
import pandas as pd 


Next, let's load in our raw example data

In [None]:
#let's make a variable with the location of our file
#note that this is a link to an website. you can also specify a file path "e.g. C:/documents/headcount_report.csv"
hc_filename='https://raw.githubusercontent.com/bvoorhees/code_and_data_for_the_people_analytics_enthusiast/master/hr_information_systems/data/widgetcorp_workforce_detail_roster_2018-12-31.csv'

hcdata = pd.read_csv(hc_filename) #read in CSV data and assign it to an object 
#we're calling "hcdata"
#remember that we can also read in excel files with pd.read_excel() 
#and even specify the sheet name by pd.read_excel(sheetname='name_of_the_sheet')
#note that we are using a google drive link to read in data-- this is the same "headcount.csv" file in the './data/' folder!


hcdata.head() #let's take out a look at the head of the data (first 5 rows) to see what's inside

Unnamed: 0,Employee Number,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Base Comp,Gender,Ethnicity,Tenure Date,Report Effective Date
0,9,Ayush,Sevin,ayush.sevin@widgetcorp.com,Admin Offices,Analyst,1.513698,38.08,79196.822851,Male,2,2017-06-26,2018-12-31
1,47,Jose,Montoya,jose.montoya@widgetcorp.com,Admin Offices,Analyst,2.496051,39.23,81589.913983,Male,4,2016-07-02,2018-12-31
2,70,Sahheeda,el-Kazmi,sahheeda.el-kazmi@widgetcorp.com,Admin Offices,Analyst,2.818739,26.77,55688.5759,Female,6,2016-03-06,2018-12-31
3,292,Aaqil,Bohm,aaqil.bohm@widgetcorp.com,Admin Offices,Analyst,4.216594,32.8,68214.063541,Male,2,2014-10-12,2018-12-31
4,324,Naseema,al-Jalil,naseema.al-jalil@widgetcorp.com,Admin Offices,Analyst,4.402523,28.04,58322.110392,Female,6,2014-08-05,2018-12-31



## **Creating New Columns** 

This file is missing some things we need for an upcoming analysis on compensation.

So, let's create a couple of columns. We need to know that these employees are active employees, and we also need to calculate their annual base salaries. 

In [None]:
#Make a column noting the employee's status
hcdata['Status']='Active' #This is a column with text information, so we need to put everything after the "=" in quotes

#make a column for annual salary, where we take the hourly rate, make it a weekly rate (x 40 hours), 
#and then x 52 for weeks in the year to get the annual rate
hcdata['Annual Salary']=hcdata['Pay Rate']*40*52

#let's take a look at the data and see if our changes showed up
hcdata.head()

Unnamed: 0,Employee Number,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Base Comp,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary
0,9,Ayush,Sevin,ayush.sevin@widgetcorp.com,Admin Offices,Analyst,1.513698,38.08,79196.822851,Male,2,2017-06-26,2018-12-31,Active,79206.4
1,47,Jose,Montoya,jose.montoya@widgetcorp.com,Admin Offices,Analyst,2.496051,39.23,81589.913983,Male,4,2016-07-02,2018-12-31,Active,81598.4
2,70,Sahheeda,el-Kazmi,sahheeda.el-kazmi@widgetcorp.com,Admin Offices,Analyst,2.818739,26.77,55688.5759,Female,6,2016-03-06,2018-12-31,Active,55681.6
3,292,Aaqil,Bohm,aaqil.bohm@widgetcorp.com,Admin Offices,Analyst,4.216594,32.8,68214.063541,Male,2,2014-10-12,2018-12-31,Active,68224.0
4,324,Naseema,al-Jalil,naseema.al-jalil@widgetcorp.com,Admin Offices,Analyst,4.402523,28.04,58322.110392,Female,6,2014-08-05,2018-12-31,Active,58323.2


## **Renaming Columns** 

A couple of these columns could use better names to help us with our work later. Let's rename them so they make better sense to us.

In [None]:
#Employee Number refers to the unique Employee ID number. Let's rename it.
hcdata=hcdata.rename(columns={'Employee Number':'Employee ID'}) 

#let's take a look at the changes
hcdata.head()

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Base Comp,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary
0,9,Ayush,Sevin,ayush.sevin@widgetcorp.com,Admin Offices,Analyst,1.513698,38.08,79196.822851,Male,2,2017-06-26,2018-12-31,Active,79206.4
1,47,Jose,Montoya,jose.montoya@widgetcorp.com,Admin Offices,Analyst,2.496051,39.23,81589.913983,Male,4,2016-07-02,2018-12-31,Active,81598.4
2,70,Sahheeda,el-Kazmi,sahheeda.el-kazmi@widgetcorp.com,Admin Offices,Analyst,2.818739,26.77,55688.5759,Female,6,2016-03-06,2018-12-31,Active,55681.6
3,292,Aaqil,Bohm,aaqil.bohm@widgetcorp.com,Admin Offices,Analyst,4.216594,32.8,68214.063541,Male,2,2014-10-12,2018-12-31,Active,68224.0
4,324,Naseema,al-Jalil,naseema.al-jalil@widgetcorp.com,Admin Offices,Analyst,4.402523,28.04,58322.110392,Female,6,2014-08-05,2018-12-31,Active,58323.2


## **Removing Columns**

We no longer need the Pay Rate column, since we are focused on annual salaries for our analysis. Let's get rid of it.

In [None]:
hcdata=hcdata.drop(columns=['Base Comp'])

#let's take a look at the changes
hcdata.head()

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary
0,9,Ayush,Sevin,ayush.sevin@widgetcorp.com,Admin Offices,Analyst,1.513698,38.08,Male,2,2017-06-26,2018-12-31,Active,79206.4
1,47,Jose,Montoya,jose.montoya@widgetcorp.com,Admin Offices,Analyst,2.496051,39.23,Male,4,2016-07-02,2018-12-31,Active,81598.4
2,70,Sahheeda,el-Kazmi,sahheeda.el-kazmi@widgetcorp.com,Admin Offices,Analyst,2.818739,26.77,Female,6,2016-03-06,2018-12-31,Active,55681.6
3,292,Aaqil,Bohm,aaqil.bohm@widgetcorp.com,Admin Offices,Analyst,4.216594,32.8,Male,2,2014-10-12,2018-12-31,Active,68224.0
4,324,Naseema,al-Jalil,naseema.al-jalil@widgetcorp.com,Admin Offices,Analyst,4.402523,28.04,Female,6,2014-08-05,2018-12-31,Active,58323.2


## **Filtering Values**

I want to filter out just the administrative offices, since we were told by higher ups that they should not be a part of our analysis. 

In pandas, there are many ways to do this. We'll focus on a way that *in my opinion* is fairly readable by other people who may stumble across your code someday. 

In [None]:
#first, we'll make a "filter" variable note "!=" means "does not equal"
admin_filter = hcdata['Department']!='Admin Offices' 
#this makes a bunch of TRUE/FALSE values that we can then feed back into our dataframe
#note we could have said "==" instead of "!=" to mean "equals"

#now, just to check, let's look at our "filter" 
admin_filter.head()

# Note the first five values are "falses" because above we said we don't want "Admin offices" in our values, 
# and the first few records in our dataset had a bunch of "admin offices"

0    False
1    False
2    False
3    False
4    False
Name: Department, dtype: bool

In [None]:
#now let's "apply" the filter into our dataset
#we use the .loc method to give the 'location' of the values that don't equal Admin Offices
#remember that .loc takes the location of rows first, and columns second (after the comma)
#the ':' denotes ALL columns in the data
hcdata = hcdata.loc[admin_filter,:] #note that we are "overwriting" the headcount data.

#and, just to check, let's look at the result

hcdata.head() #behold the first few records we previously saw are now out of the dataset

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary
24,17,William,Torain,william.torain@widgetcorp.com,Complaince,Analyst,1.949403,28.07,Male,3,2017-01-17,2018-12-31,Active,58385.6
25,33,Rachelle,Thomas,rachelle.thomas@widgetcorp.com,Complaince,Analyst,2.355758,27.17,Female,1,2016-08-22,2018-12-31,Active,56513.6
26,63,Drue,Hightower,drue.hightower@widgetcorp.com,Complaince,Associate,2.728467,37.87,Male,3,2016-04-08,2018-12-31,Active,78769.6
27,178,Husniyya,al-Nazir,husniyya.al-nazir@widgetcorp.com,Complaince,Analyst,3.682307,26.93,Female,6,2015-04-26,2018-12-31,Active,56014.4
28,196,Kelsey,Nguyen,kelsey.nguyen@widgetcorp.com,Complaince,Analyst,3.765248,27.75,Female,2,2015-03-26,2018-12-31,Active,57720.0


## **Renaming/replacing values**

Sometimes our HRIS data isn't as clean as it needs to be. Department name changes, for one, are common after reorgs. They also can be abbreviated (or take different abbreviations over time).

In this example, we'll take a look at our deparment names to see if some need to be rewritten, and then we'll rewrite the ones we want to rewrite to a cleaner, unified name. 

In [None]:
#An easy way to see all departments and the number of rows associated 
#with them is the .value_counts() method
hcdata['Department'].value_counts()

Engineering                                        92
Sales                                              84
Operations                                         68
Marketing                                          66
Procurement                                        54
Design                                             35
Information Technology and Information Seucrity    27
HR                                                 24
Finance                                            19
IT/IS                                              15
Legal                                              12
Saless                                             12
Complaince                                         10
Executive                                           8
Name: Department, dtype: int64

In [None]:
#Note that Saless is spelt wrong, and that IT/IS is an abbreviation for Information Technology and Security (which is also spelt wrong).
#Let's replace both with the correct spelling for Sales, and an change easy-to-use/graph abbreviation for IT/IS.

#Starting with Sales:
#We need to call out (or filter like the above) the subset of records we want to replace
saless_filter=hcdata['Department']=='Saless'
#Now let's apply that filter to our dataset so we can overwrite those bad records
hcdata.loc[saless_filter,'Department']='Sales' #Again we use the .loc method, but also put 'Department' as the second "location"

#Next IT/IS:
#Like what we did with "Saless" above, we first make a filter
IT_filter=hcdata['Department']=='Information Technology and Information Seucrity'
#Again let's apply that filter to our dataset so we can overwrite those bad records
hcdata.loc[IT_filter,'Department']='IT/IS'

#Let's double check our work to make sure the departments are appropriately grouped
hcdata['Department'].value_counts()

Sales          96
Engineering    92
Operations     68
Marketing      66
Procurement    54
IT/IS          42
Design         35
HR             24
Finance        19
Legal          12
Complaince     10
Executive       8
Name: Department, dtype: int64

## **Finding missing values**

Sometimes our HRIS data isn't complete for every employee. Missing values need to be identified and dealt with frequently; usually by consulting with the HRIS team to get the missing data imputed, or, in some cases, imputing/filling in missing values ourselves. 

Lets start by identifying missing values. Let's be sure that Annual Salary is complete for all employees correctly. 

In [None]:
#Just like what we did above to find records that met a certain criteria, we'll use the .loc method, but we'll
#apply it to the "Annual Salary" variable

missing_salary_filter=hcdata['Annual Salary'].isna()

hcdata.loc[missing_salary_filter] 
#Let's talk to our HRIS and comp teams before we make any assumptions and "fill in" the data ourselves
#one way we could fill in the data, however, would be to use an average or median value of some kind; 
#for more on this visit https://pa ndas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
#or https://stackoverflow.com/questions/18689823/pandas-dataframe-replace-nan-values-with-average-of-columns

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary


## **Sorting**

Sometimes we want to share this granular data with other teams, or include as an appendix in our analysis for stakeholders to sift through themselves. A good idea to make it "easier" on the eyes of the reader is to sort the data in some order. 

Let's sort in alphabetical order of Department, and then annual salary. 

In [None]:
#We do this by passing a list of the to the sort_values method. 
#Remember we need to say "inplace=True" for the changes to "save" to the hcdata object

hcdata=hcdata.sort_values(by=['Department','Annual Salary'],ascending=[True,False])

#now let's take a look at the head of our data
hcdata.head()
#notice how the row numbes have changed. 

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary
30,427,Erik,Sanders,erik.sanders@widgetcorp.com,Complaince,Associate,4.89498,49.94,Male,4,2014-02-07,2018-12-31,Active,103875.2
33,848,Christina,Blaze,christina.blaze@widgetcorp.com,Complaince,VP,7.400493,48.62,Female,1,2011-08-06,2018-12-31,Active,101129.6
26,63,Drue,Hightower,drue.hightower@widgetcorp.com,Complaince,Associate,2.728467,37.87,Male,3,2016-04-08,2018-12-31,Active,78769.6
29,305,Edgar,Lopez,edgar.lopez@widgetcorp.com,Complaince,Analyst,4.324053,37.33,Male,4,2014-09-03,2018-12-31,Active,77646.4
31,513,Marianne,Rem,marianne.rem@widgetcorp.com,Complaince,Associate,5.307411,34.07,Female,2,2013-09-09,2018-12-31,Active,70865.6


## **Combining/Merging data sources together**

The bane of the excel analyst's existence sometimes can be vlookup. Not only is it a function we have to employ manually every time we want to combine data sources, but it can be prone to errors if not set up properly. Python is also prone to errors, but at least it isn't as manual. It can also combine much more data together very quickly, and because you've used code you have a record of how exactly it was done. 

We've noticed that our headcount data didn't ship with employment data. We have another file for this called `highest_education_completed.csv` that we'll first import, then attach to our headcount data (`hcdata`) using employee IDs as the basis for 'attaching' the data.

In [None]:
#first, let's read in our data and assign it to an object called educationdata
educationdata=pd.read_csv('https://raw.githubusercontent.com/bvoorhees/code_and_data_for_the_people_analytics_enthusiast/master/hr_information_systems/data/widgetcorp_worker_education.csv') 

#note that we are using a google drive link to read in data-- this is the same "highest_education_completed.csv" file in the './data/' folder!


#next, let's take a look at the data to make they make sense
educationdata.head() #notice that the name "Employee Number" is used instead of "Employeee ID" like what we did above

Unnamed: 0,Employee Number,Education
0,9,High School Diploma
1,47,High School Diploma
2,70,High School Diploma
3,97,High School Diploma
4,160,Some College


In [None]:
#First, let's change the name of the "Employee Number" column to "Employee ID"
educationdata=educationdata.rename(columns={'Employee Number':'Employee ID'})

#Now we are ready to actually combine the two datasets together. This is fairly simple, and we can do it with
#the pandas merge function, telling the merge operation to perform the merge using the "Employee ID" for the reference column

combined_data=pd.merge(hcdata,educationdata,on='Employee ID')

combined_data.head() #taking a look at the data, we see the merge is complete

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary,Education
0,427,Erik,Sanders,erik.sanders@widgetcorp.com,Complaince,Associate,4.89498,49.94,Male,4,2014-02-07,2018-12-31,Active,103875.2,Bachelor
1,848,Christina,Blaze,christina.blaze@widgetcorp.com,Complaince,VP,7.400493,48.62,Female,1,2011-08-06,2018-12-31,Active,101129.6,Some College
2,63,Drue,Hightower,drue.hightower@widgetcorp.com,Complaince,Associate,2.728467,37.87,Male,3,2016-04-08,2018-12-31,Active,78769.6,High School Diploma
3,305,Edgar,Lopez,edgar.lopez@widgetcorp.com,Complaince,Analyst,4.324053,37.33,Male,4,2014-09-03,2018-12-31,Active,77646.4,Some College
4,513,Marianne,Rem,marianne.rem@widgetcorp.com,Complaince,Associate,5.307411,34.07,Female,2,2013-09-09,2018-12-31,Active,70865.6,Master


**Note:** 
This is a very simple and neat example where we have a record for every employee in both the `hcdata` and the `educationata`. In some cases you may have more records than you need in one of the files you're working with, or less records than you need in one of the files you're working with. This means you need to tell `pd.merge` how to handle the merge expicitly (with by typing *left*, *right*, *inner* or *outer* after a `how=` argument) , and of course examine your merge results carefully; a good overview of `pd.merge` can be found __[here](https://www.kaggle.com/crawford/python-merge-tutorial)__.

## **Dates and duration of time**

Dates in pandas, just like in any data processing/analysis software, need special attention. Even when your HRIS system has exported a clean date in excel, you must check to see that the date was recognized by Pandas. If it wasn't recognized as a date, then we can convert it to a date quite quickly. This makes working with dates much more convenient. 

In [None]:
#we'll use the .dtypes method to see the types of data in each column in our combined_data 
combined_data.dtypes #here we can see the "Tenure Date" variable is not labeled as a date

Employee ID                int64
First Name                object
Last Name                 object
Email Address             object
Department                object
Job Level                 object
Tenure                   float64
Pay Rate                 float64
Gender                    object
Ethnicity                  int64
Tenure Date               object
Report Effective Date     object
Status                    object
Annual Salary            float64
Education                 object
dtype: object

In [None]:
#So let's convert the Date of Hire variable to the proper data type (date) using the .to_datetime function in pandas
combined_data['Tenure Date']=pd.to_datetime(combined_data['Tenure Date'])

combined_data.dtypes #now we see it's a datetime format

Employee ID                       int64
First Name                       object
Last Name                        object
Email Address                    object
Department                       object
Job Level                        object
Tenure                          float64
Pay Rate                        float64
Gender                           object
Ethnicity                         int64
Tenure Date              datetime64[ns]
Report Effective Date            object
Status                           object
Annual Salary                   float64
Education                        object
dtype: object

Now, let's suppose we want to calculate each employee's tenure relative to today.

In [None]:
#like the above, we'll use the to_datetime function to take today's date and subtract the employee's hire date 
#to get the number of days employed at the company
combined_data['Tenure_Today']=pd.to_datetime('2020-11-21') - combined_data['Tenure Date']

#now let's take a look at the data
combined_data.head() #notice how pandas has converted the tenure into days with a label at the end

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary,Education,Tenure_Today
0,427,Erik,Sanders,erik.sanders@widgetcorp.com,Complaince,Associate,4.89498,49.94,Male,4,2014-02-07,2018-12-31,Active,103875.2,Bachelor,2479 days
1,848,Christina,Blaze,christina.blaze@widgetcorp.com,Complaince,VP,7.400493,48.62,Female,1,2011-08-06,2018-12-31,Active,101129.6,Some College,3395 days
2,63,Drue,Hightower,drue.hightower@widgetcorp.com,Complaince,Associate,2.728467,37.87,Male,3,2016-04-08,2018-12-31,Active,78769.6,High School Diploma,1688 days
3,305,Edgar,Lopez,edgar.lopez@widgetcorp.com,Complaince,Analyst,4.324053,37.33,Male,4,2014-09-03,2018-12-31,Active,77646.4,Some College,2271 days
4,513,Marianne,Rem,marianne.rem@widgetcorp.com,Complaince,Associate,5.307411,34.07,Female,2,2013-09-09,2018-12-31,Active,70865.6,Master,2630 days


In [None]:
#let's convert the data into years. 

#first we need to convert the data type to an integer that we can divide by cleanly.
#we use the .dt.days method to do this
combined_data['Tenure_Today']=combined_data['Tenure_Today'].dt.days

#next, we just divide tenure by 365.25 to convert to years (.25 counts for leap years!) 
#round to the nearest hundreth, and write over our Tenure variable
combined_data['Tenure_Today']=round(combined_data['Tenure_Today']/365.25,2)

combined_data.head() #notice how pandas has converted the tenure into years. Now we are ready to analyze our data!

Unnamed: 0,Employee ID,First Name,Last Name,Email Address,Department,Job Level,Tenure,Pay Rate,Gender,Ethnicity,Tenure Date,Report Effective Date,Status,Annual Salary,Education,Tenure_Today
0,427,Erik,Sanders,erik.sanders@widgetcorp.com,Complaince,Associate,4.89498,49.94,Male,4,2014-02-07,2018-12-31,Active,103875.2,Bachelor,6.79
1,848,Christina,Blaze,christina.blaze@widgetcorp.com,Complaince,VP,7.400493,48.62,Female,1,2011-08-06,2018-12-31,Active,101129.6,Some College,9.3
2,63,Drue,Hightower,drue.hightower@widgetcorp.com,Complaince,Associate,2.728467,37.87,Male,3,2016-04-08,2018-12-31,Active,78769.6,High School Diploma,4.62
3,305,Edgar,Lopez,edgar.lopez@widgetcorp.com,Complaince,Analyst,4.324053,37.33,Male,4,2014-09-03,2018-12-31,Active,77646.4,Some College,6.22
4,513,Marianne,Rem,marianne.rem@widgetcorp.com,Complaince,Associate,5.307411,34.07,Female,2,2013-09-09,2018-12-31,Active,70865.6,Master,7.2


There is a lot more than we can cover here. A great video intro to `pd.datetime` can be found __[here](https://www.youtube.com/watch?v=yCgJGsg0Xa4)__. Be sure to look at the links in the description of that video!!


## **Write the cleaned data for this exercise**

We can use the `.to_csv` method to export data to a CSV file. Bust first, we will need to 'mount' our google drive to python so that it can write data to it.

In [None]:
#first, because we are using google colab, we want to make sure it can see our google drive
#to do this, we will mount our drive using the code below, and write the data to a folder called 
#'Python for HCM Datasets from Lessons' WHICH YOU WILL NEED TO MAKE IN YOUR GOOGLE DRIVE HOME FOLDER

from google.colab import drive
drive.mount('drive')

Mounted at drive


In [None]:
#now, write the data
combined_data.to_csv('drive/My Drive/Python for HCM Datasets from Lessons/combined_data_for_analysis.csv',index=False) #specifying index=False keeps .to_csv from writing out row numbers in the first column

#note that we could change the file path to a local drive e.g. "C:/desktop/my_clean_report.csv"