# Data Summary
---
#### According to the Google Sheets file the Columns consist of: 

**1:** Age Range \
**2:** Industry \
**3:** Job Title\
**4:** Additional Job Context\
**5:** Annual Salary \
**6:** Additional Income (bonuses, overtime...)\
**7:** Currency (USD, GBP, EUR...)\
**8:** Other Column for additional currency inputs (one input INR)\
**9:** Context for Additional Income (bonus percentages)\
**10:** What Country You work in\
**11:** If U.S. what State they work in\
**12:** If U.S. what City they work in\
**13:** How Many Years of Professional Experience overall\
**14:** How Many Years of Professional Experience in this specific field\
**15:** Highest level of education completed\
**16:** Gender\
**17:** Race\
\
I will download the sheets file as a csv and import it as a dataframe (I renamed the columns for ease of use as they were labeled as the question asked in the original survey):

In [34]:
import pandas as pd

# read csv file
df = pd.read_csv('salarySurveyRenamed.csv')

# check the data
print(df.head())

# check data types
df.dtypes

            Timestamp ageRange                       industry  \
0  4/27/2021 11:02:10    25-34   Education (Higher Education)   
1  4/27/2021 11:02:22    25-34              Computing or Tech   
2  4/27/2021 11:02:38    25-34  Accounting, Banking & Finance   
3  4/27/2021 11:02:41    25-34                     Nonprofits   
4  4/27/2021 11:02:42    25-34  Accounting, Banking & Finance   

                                   jobTitle jobContext annualSalary  \
0        Research and Instruction Librarian        NaN       55,000   
1  Change & Internal Communications Manager        NaN       54,600   
2                      Marketing Specialist        NaN       34,000   
3                           Program Manager        NaN       62,000   
4                        Accounting Manager        NaN       60,000   

   addIncome currency otherCurrency addIncomeContext     workCountry  \
0        0.0      USD           NaN              NaN   United States   
1     4000.0      GBP           NaN   

Timestamp            object
ageRange             object
industry             object
jobTitle             object
jobContext           object
annualSalary         object
addIncome           float64
currency             object
otherCurrency        object
addIncomeContext     object
workCountry          object
usState              object
usCity               object
overallProExp        object
fieldExp             object
eduLevel             object
gender               object
race                 object
dtype: object

The dataframe consists of an additional timestamp column that will need to be removed during cleaning.

---

# Data Cleaning

### 1: Combining Salary and Additional Salary
- convert values to numeric int/float (addIncome is already a float, so just change salary)
- create a total compensation column (salary + additional)
- convert to one standard currency, USD

In [None]:
# check for nulls and fill if needed
print("NaNs in salary: ", df["annualSalary"].isna().sum())
print("NaNs in Additional Income: ",df["addIncome"].isna().sum())

# fill the NaNs in additional income with 0 so I can correctly calculate total compensation
df["addIncome"] = df["addIncome"].fillna(0)
print(df["addIncome"].head(5))

# check types
df.dtypes[["annualSalary", "addIncome"]]

# convert annualSalary to type float and remove commas
df["annualSalary"] = pd.to_numeric(df["annualSalary"].astype(str).str.replace(",", ""), errors="coerce")
print(df["annualSalary"].head(5))

# now calculate for total income
df["totalCompensation"] = df["annualSalary"] + df["addIncome"]
print(df["totalCompensation"].head(5))


# we have to standardize all values to USD
# combine currency and otherCurrency
df["currency"] = df["currency"].fillna(df["otherCurrency"])
print("NaNs in currency: ", df["currency"].isna().sum())

# drop otherCurrency
df = df.drop(columns=["otherCurrency"])

# find unique currencies to create dictionary for conversion
df["currency"] = df["currency"].str.strip().str.upper()
unique_currencies = df["currency"].dropna().unique()
print(unique_currencies)


NaNs in salary:  0
NaNs in Additional Income:  0
0       0.0
1    4000.0
2       0.0
3    3000.0
4    7000.0
Name: addIncome, dtype: float64
0    55000
1    54600
2    34000
3    62000
4    60000
Name: annualSalary, dtype: int64
0    55000.0
1    58600.0
2    34000.0
3    65000.0
4    67000.0
Name: totalCompensation, dtype: float64
NaNs in currency:  0
['USD' 'GBP' 'CAD' 'EUR' 'AUD/NZD' 'OTHER' 'CHF' 'ZAR' 'SEK' 'HKD' 'JPY']


In [None]:
# dictionary for conversions, based on google 4/22/25:
conversion_rates = {
    "USD": 1.0,
    "GBP": 1.34,
    "CAD": 0.72,
    "EUR": 1.15,
    "AUD/NZD": 0.70,
    "CHF": 1.23,
    "ZAR": 0.054,
    "SEK": 0.11,
    "HKD": 0.13,
    "JPY": 0.0071,
}

# drop rows that answered other, as too many values and non standard answers
df = df[df["currency"].str.strip().str.upper() != "OTHER"]

# find unique currencies to create dictionary for conversion
df["currency"] = df["currency"].str.strip().str.upper()
unique_currencies = df["currency"].dropna().unique()
print(unique_currencies)


['USD' 'GBP' 'CAD' 'EUR' 'AUD/NZD' 'CHF' 'ZAR' 'SEK' 'HKD' 'JPY']


# 2. Drop Unused Columns

- dropping columns such as timestamp
-

In [42]:
print(df.columns.tolist())

# drop timestamp
df = df.drop(columns=["Timestamp"])

df.head

['Timestamp', 'ageRange', 'industry', 'jobTitle', 'jobContext', 'annualSalary', 'addIncome', 'currency', 'otherCurrency', 'addIncomeContext', 'workCountry', 'usState', 'usCity', 'overallProExp', 'fieldExp', 'eduLevel', 'gender', 'race', 'totalCompensation']


<bound method NDFrame.head of       ageRange                       industry  \
0        25-34   Education (Higher Education)   
1        25-34              Computing or Tech   
2        25-34  Accounting, Banking & Finance   
3        25-34                     Nonprofits   
4        25-34  Accounting, Banking & Finance   
...        ...                            ...   
28130    25-34   Engineering or Manufacturing   
28131    25-34              Computing or Tech   
28132    18-24                            NaN   
28133    25-34              Computing or Tech   
28134    25-34                            ABA   

                                       jobTitle jobContext  annualSalary  \
0            Research and Instruction Librarian        NaN         55000   
1      Change & Internal Communications Manager        NaN         54600   
2                          Marketing Specialist        NaN         34000   
3                               Program Manager        NaN         62000   
4