## Import the Pandas library and create DataFrame

Before doing anything else, you'll need to import Pandas and get some data to work with.

In [98]:
#!pip install pandas
import pandas as pd

In [99]:
data = {
    'Name': ['Alice', 'Bob', 'Claire', 'David', 'Emma','Emma'],
    'Age': [25, 30, 22, None, 35,35],
    'Salary': ['50000', '60000', '45000€', '70000', '80000', '80000'],
    'Department': ['HR', None, 'Finance', 'IT', 'Marketing', 'Marketing'],
    'Gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Female'],
    'Hire_Date': ['2020-03-15', '2019-05-20', '2021-01-10', '2018-11-05', '2022-02-28', '2022-02-28'],
    'Performance_Rating': ['4.5', '3.8', None, '4.0', '4.7', '4.7'],
    'ID':['1.5','1.6','1.7','1.8','1.9','1.9']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


## 1. Dealing with missing values

Missing values, also known as null values, can be a headache when handling data. They might cause errors, or throw off counts.

### .dropna()
One strategy to deal with missing values is to simply eliminate them. `.dropna()` will delete either rows or columns containing missing values.

> **Axis:** Several Pandas methods contain a parameter called 'axis'. This parameter determines in which direction the function will be applied.  
`axis=0` applies along rows  
`axis=1` applies along columns

In [100]:
df.dropna(axis=0)   # to drop rows

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


Each row containing at least one missing value (None or NaN) has been removed from the output.

In [101]:
df.dropna(axis=1)   # to drop columns

Unnamed: 0,Name,Salary,Gender,Hire_Date,ID
0,Alice,50000,Female,2020-03-15,1.5
1,Bob,60000,Male,2019-05-20,1.6
2,Claire,45000€,Female,2021-01-10,1.7
3,David,70000,Male,2018-11-05,1.8
4,Emma,80000,Female,2022-02-28,1.9
5,Emma,80000,Female,2022-02-28,1.9


This time, all columns containing at least one missing value have been dropped.

### .fillna()
Another method is to fill in missing values. These can be inferred from other values in the row or column, or set to a constant.   
Let's take the mean of everyones age and use that value to fill in any missing data in the column.

In [102]:
age_mean = df['Age'].mean()
age_mean

29.4

In [103]:
df['Age'] # here we see one missing age

0    25.0
1    30.0
2    22.0
3     NaN
4    35.0
5    35.0
Name: Age, dtype: float64

In [104]:
df['Age'].fillna(age_mean)

0    25.0
1    30.0
2    22.0
3    29.4
4    35.0
5    35.0
Name: Age, dtype: float64

## 2. Handling duplicates

Duplicates can throw off an analysis. They might make it seem as if more sales were made or more customers exist than is actually true.

### .drop_duplicates()
Look closely at the last two lines of `df`. They contain all of the same information! `.drop_duplicates()` exists to spot such repeats and get rid of them in one easy step.

In [105]:
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


In [106]:
df.drop_duplicates()

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


What if we had actually hired two people named Emma to the marketing department on the same day?
    
`.drop_duplicates()` can also target a *single column* for repeated information by using the `subset=` parameter. In this way, items that should be unique — such as "ID" — can be used to determine which rows should be dropped.

In [107]:
df.drop_duplicates(subset="ID")

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


## 3. Data types conversions

To perform certain operations, data must have the right type. There is no way to find the median of a collection of strings, only of floats or integers.    
In Pandas, the data type of a Series or value is known as its "dtype".

### .astype()
By entering the desired dtype as a parameter, this method can change a Series, or even an entire DataFrame. This is also known as "casting".

In [108]:
df['Performance_Rating'].astype('float')

0    4.5
1    3.8
2    NaN
3    4.0
4    4.7
5    4.7
Name: Performance_Rating, dtype: float64

### pd.to_numeric()
A bit more robust than `.astype()`, `pd.to_numeric()` is able to handle non-convertable values by returning nulls.

> **Note:** Unlike the other techniques seen so far in this notebook, this one is not a method attached to `DataFrame` or `Series`. Instead, it is a function contained in the Pandas library. This is why it is called with `pd` in front, rather than `df` or `df['column']`.

In [109]:
pd.to_numeric(df['Salary'], errors='coerce')

0    50000.0
1    60000.0
2        NaN
3    70000.0
4    80000.0
5    80000.0
Name: Salary, dtype: float64

## 4. Combining

You may wish sometimes to turn two DataFrames into one. This can be especially helpful when extracting data from multipe sources but analysing it all together.

### pd.concat()
Another *function* in the Pandas library, `pd.concat()` combines DataFrames one-atop-the-other or side-by-side. Alignments are made on the labels of the specified axis.

In [110]:
skills_data = {
     'Name': ['Alice', 'Bob', 'Claire', 'David', 'Emma', 'Emma'],
    'Skills': ['Python', 'Java', 'SQL', 'Python, C++', 'R, Excel', 'R, Excel']
}
skills_df = pd.DataFrame(skills_data)
skills_df

Unnamed: 0,Name,Skills
0,Alice,Python
1,Bob,Java
2,Claire,SQL
3,David,"Python, C++"
4,Emma,"R, Excel"
5,Emma,"R, Excel"


In [111]:
pd.concat([df,skills_df['Skills']], axis=1)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,Skills
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5,Python
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,Java
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7,SQL
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,"Python, C++"
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,"R, Excel"
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,"R, Excel"


A few things to notice above:
- The DataFrames are passed to `pd.concat()` *as a list*
- Only one column of `skills_df` appears, because only that column was called with the indexing operator
- `axis=1` controls the *direction* of concatenation; remember `1` in terms of Pandas axes means column-wise

In [112]:
new_people_data = {
    'Name': ['Grace', 'Henry'],
    'Age': [29, 31],
    'Salary': [60000, 68000],
    'Department': ['Marketing', 'IT'],
    'Gender': ['Female', 'Male'],
    'Hire_Date': ['2023-08-10', '2023-08-15'],
    'Performance_Rating': [4.5, 4.2],
    'ID':['2.0','2.1']
}
new_people_df = pd.DataFrame(new_people_data)
new_people_df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Grace,29,60000,Marketing,Female,2023-08-10,4.5,2.0
1,Henry,31,68000,IT,Male,2023-08-15,4.2,2.1


In [113]:
pd.concat([df, new_people_df], axis=0, ignore_index=True)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
6,Grace,29.0,60000,Marketing,Female,2023-08-10,4.5,2.0
7,Henry,31.0,68000,IT,Male,2023-08-15,4.2,2.1


This time, by using `axis=0`, new *rows* are added to `df`.    
`ignore_index=True` stops the indexes in `new_people_df` from carrying over. Otherwise, the resulting index would have been      
`[0, 1, 2, 3, 4, 5, 0, 1]`

## Sorting

### .sort_values()
As you might guess, `.sort_values()` will sort the given column(s) numerically or alphabetically. By default, results will be sorted in ascending order and null values will be sorted last.

In [114]:
df.sort_values('Age')

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
3,David,,70000,IT,Male,2018-11-05,4.0,1.8


In [115]:
df.sort_values('Age', ascending=False)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8


## Save the change!!

In [116]:
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9
5,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


⚠️ Notice that not a single change has actually been made to the `df` — there are still duplicates and missing values, and neither skills nor new people have been added.  

In each case, the outputs of the above techniques need to be *explicitly assigned* in order to remain. In otherwise to "change" `df`, you need to overwrite it with one of the *new* outputs you've seen.

In [117]:
df = df.drop_duplicates().copy()
df['Performance_Rating'] = df['Performance_Rating'].astype(float)
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9


> **Note:** In general, if you enter some code and a DataFrame appears on your screen, this is a *new object* and would need to be assigned under the same name to "alter the original".
>
> Sometimes, however, what appears on the screen is not new (a 'copy'), but only a view of an object (a 'slice'). In those cases, it is important to save a *copy of the slice* via `.copy()`, as above.

## 5. Dropping and creating columns

There are times when a DataFrame contains data which is not needed. At other times, it needs additional information.

### .drop()
Used to eliminate rows or columns according to their label(s). Again, `axis`makes an appearance to specify in which direction to operate.

In [118]:
df.drop('ID',axis=1)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8
2,Claire,22.0,45000€,Finance,Female,2021-01-10,
3,David,,70000,IT,Male,2018-11-05,4.0
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7


In [119]:
df.drop(4,axis=0)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7
3,David,,70000,IT,Male,2018-11-05,4.0,1.8


### Create a column
You can create a new column by simply assigning it a value.

In [120]:
df['new_column'] = 'some_value'
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5,some_value
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7,some_value
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,some_value


A new column can also be a derived from one or more existing columns.

In [121]:
df.loc[:, 'another_column'] = df['Gender'] + ' ' + df['Department']
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,22.0,45000€,Finance,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


## 5. Conditional updating

There are times when values need to be updated, but only if they meet certain criteria.

### Using `.loc`
Since `.loc` allows you to filter conditionally, it is already well suited to updating values via assignment.    
`df.loc[condition, column] = new_value`

Someone has noticed that Claire's age was entered incorrectly! Let's correct it:

In [122]:
name_mask = df['Name'] == "Claire"
df.loc[name_mask,['Age']]

Unnamed: 0,Age
2,22.0


In this first step, we are able to view Claire's age. A new value can be assigned via `=`.

In [123]:
df.loc[name_mask,['Age', 'Salary']] = [27, 10000]
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,10000,Finance,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


Several values meeting the same condition can also be updated at the same time.    
Let's change the 'Marketing' department to 'Public Relations'.

In [124]:
marketing_mask = df['Department'] == "Marketing"
df.loc[marketing_mask, 'Department'] = "Public Relations"
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,50000,HR,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,10000,Finance,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Public Relations,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


It is also possible to use multiple conditions when filtering and updating data.     
Let's give everyone a raise who is under 29 and works in Finance or HR.

In [125]:
age_mask = df['Age'] < 29
department_mask = df['Department'].isin(['Finance', 'HR'])

df.loc[age_mask & department_mask, "Salary"] = 55000
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,55000,HR,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,55000,Finance,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Public Relations,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


> **Note:** One reason to assign new values with `.loc` is that it always updates the original DataFrame. Other methods may modify only a copy of the data, leading to unpredictable outcomes or lost work.

### .where()
Sometimes instead of looking to *meet* a condition, you may wish to update values that *fail to meet* your criteria.

In [126]:
df

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,55000,HR,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,55000,Finance,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,IT,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Public Relations,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


Let's check our employees' performance ratings and give a salary of only 30000 to anyone doing worse than a 4:

In [127]:
df['Salary'].where(df['Performance_Rating'] >= 4, 30000)

0    55000
1    30000
2    30000
3    70000
4    80000
Name: Salary, dtype: object

This is functionally equivalent to writing     
`df.loc[~(df['Performance_Rating'] >= 4), 'Salary'] = 30000`      
except that `.where()` requires you to explicitly *save* the result, whereas assignment is already written in with `.loc`.

In [128]:
example_df = df.copy()
example_df.loc[~(example_df['Performance_Rating'] >= 4), 'Salary'] = 30000
example_df['Salary']

0    55000
1    30000
2    30000
3    70000
4    80000
Name: Salary, dtype: object

Since the logic of changing a value where a condition is *not met* might feel a bit strange, you can always flip the logic by negating the condition with `~`.

In [129]:
df['Salary'].where(~(df['Performance_Rating'] < 4), 30000)

0    55000
1    30000
2    55000
3    70000
4    80000
Name: Salary, dtype: object

> ⚠️ **Watch out for null values!**
>
> Null values have certain behaviour characteristics that need to be accounted for when updating values conditionally:
- Equality comparisons of all sorts evaluate to `False` *except*
- Non-equality comparisons (`!=`) evaluate to `True`
- Negation with `~` will then flip **either** of the above results   
>  
> Notice how the salary in row 2 is altered by using `.where()` and `.loc[~ ]`, but is **not altered** when using `.where(~ )`.

### .replace()
Offering even more options and flexibility, `.replace()` is capable of updating values across multiple columns and assigning different results to different matches, as well as handling more advanced "pattern matching".  

Imagine all our departments changed their names to something else. Instead of writing as many conditions as there are replacements, we can use `.replace()` in a variety of ways. Here is an example using a dictionary:

In [130]:
replace_dict = {'HR': 'Human Resources',
                'Finance':'Accounting',
                'IT':'Tech',
                'Public Relations':'Marketing'}

df.replace(replace_dict)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,55000,Human Resources,Female,2020-03-15,4.5,1.5,some_value,Female HR
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,55000,Accounting,Female,2021-01-10,,1.7,some_value,Female Finance
3,David,,70000,Tech,Male,2018-11-05,4.0,1.8,some_value,Male IT
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


You can direct the changes to a single column as well:

In [131]:
df['Department'].replace(replace_dict)

0    Human Resources
1               None
2         Accounting
3               Tech
4          Marketing
Name: Department, dtype: object

Pattern matching can be useful when trying to replace information in multiple columns with slightly differing values. This is done with the argument `regex=True`. Notice the code below changes values in both Department and `another_column`.

In [132]:
replace_dict = {'HR': 'Human Resources',
                'Finance':'Accounting',
                'IT':'Tech',
                'Public Relations':'Marketing'}
df.replace(replace_dict, regex=True)

Unnamed: 0,Name,Age,Salary,Department,Gender,Hire_Date,Performance_Rating,ID,new_column,another_column
0,Alice,25.0,55000,Human Resources,Female,2020-03-15,4.5,1.5,some_value,Female Human Resources
1,Bob,30.0,60000,,Male,2019-05-20,3.8,1.6,some_value,
2,Claire,27.0,55000,Accounting,Female,2021-01-10,,1.7,some_value,Female Accounting
3,David,,70000,Tech,Male,2018-11-05,4.0,1.8,some_value,Male Tech
4,Emma,35.0,80000,Marketing,Female,2022-02-28,4.7,1.9,some_value,Female Marketing


## Challenges

In [133]:
data = {
    'City': ['New York', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco', 'London', 'Paris', 'Tokyo', 'Sydney', 'Toronto'],
    'Population': [8622698, 8622698, 3990456, None, 2312717, 870887, 8982000, 2140526, None, '2079700', 2930000],
    'Area_sq_miles': [302.6, 302.6, 468.7, 227.3, 627.8, 46.9, 607, None, 845.8, 2058.4, 630.2],
    'Country': ['USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'United Kingdom', 'France', 'Japan', 'Australia', 'Canada'],
    'Year_Founded': [1624, 1624, 1781, 1833, 1836, None, 43, 52, 660, 1788, None]
}

cities_df = pd.DataFrame(data)

cities_df

Unnamed: 0,City,Population,Area_sq_miles,Country,Year_Founded
0,New York,8622698.0,302.6,USA,1624.0
1,New York,8622698.0,302.6,USA,1624.0
2,Los Angeles,3990456.0,468.7,USA,1781.0
3,Chicago,,227.3,USA,1833.0
4,Houston,2312717.0,627.8,USA,1836.0
5,San Francisco,870887.0,46.9,USA,
6,London,8982000.0,607.0,United Kingdom,43.0
7,Paris,2140526.0,,France,52.0
8,Tokyo,,845.8,Japan,660.0
9,Sydney,2079700.0,2058.4,Australia,1788.0


### Challenge 1
Convert the 'Population' column to numeric


In [134]:
# Your code here
cities_df['Population'] = cities_df['Population'].astype('float')
cities_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   City           11 non-null     object 
 1   Population     9 non-null      float64
 2   Area_sq_miles  10 non-null     float64
 3   Country        11 non-null     object 
 4   Year_Founded   9 non-null      float64
dtypes: float64(3), object(2)
memory usage: 572.0+ bytes


### Challenge 2
Fill missing area values with the median area



In [135]:
# Your code here
area_median = cities_df['Area_sq_miles'].median()
#cities_df['Area_sq_miles'].fillna(area_median, inplace = True)
#cities_df

cities_df.loc[cities_df['Area_sq_miles'].isnull(), 'Area_sq_miles'] = cities_df['Area_sq_miles'].median()
cities_df

Unnamed: 0,City,Population,Area_sq_miles,Country,Year_Founded
0,New York,8622698.0,302.6,USA,1624.0
1,New York,8622698.0,302.6,USA,1624.0
2,Los Angeles,3990456.0,468.7,USA,1781.0
3,Chicago,,227.3,USA,1833.0
4,Houston,2312717.0,627.8,USA,1836.0
5,San Francisco,870887.0,46.9,USA,
6,London,8982000.0,607.0,United Kingdom,43.0
7,Paris,2140526.0,537.85,France,52.0
8,Tokyo,,845.8,Japan,660.0
9,Sydney,2079700.0,2058.4,Australia,1788.0


### Challenge 3
Remove duplicates from the DataFrame


In [136]:
# Your code here
cities_df = cities_df.drop_duplicates()
#reset index?
cities_df
#removing duplicates after filling in areas by mean?

Unnamed: 0,City,Population,Area_sq_miles,Country,Year_Founded
0,New York,8622698.0,302.6,USA,1624.0
2,Los Angeles,3990456.0,468.7,USA,1781.0
3,Chicago,,227.3,USA,1833.0
4,Houston,2312717.0,627.8,USA,1836.0
5,San Francisco,870887.0,46.9,USA,
6,London,8982000.0,607.0,United Kingdom,43.0
7,Paris,2140526.0,537.85,France,52.0
8,Tokyo,,845.8,Japan,660.0
9,Sydney,2079700.0,2058.4,Australia,1788.0
10,Toronto,2930000.0,630.2,Canada,


### Challenge 4
Fill missing population values with the mean population


In [137]:
# Your code here
print(cities_df)
population_mean = cities_df['Population'].mean()
#is None null? How does null looks like? what if no value is entered, just ", ,"?
cities_df.loc[cities_df['Population'].isna(), 'Population'] = population_mean # or .isnull()
print('\n', cities_df)

             City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1624.0
2     Los Angeles   3990456.0         468.70             USA        1781.0
3         Chicago         NaN         227.30             USA        1833.0
4         Houston   2312717.0         627.80             USA        1836.0
5   San Francisco    870887.0          46.90             USA           NaN
6          London   8982000.0         607.00  United Kingdom          43.0
7           Paris   2140526.0         537.85          France          52.0
8           Tokyo         NaN         845.80           Japan         660.0
9          Sydney   2079700.0        2058.40       Australia        1788.0
10        Toronto   2930000.0         630.20          Canada           NaN

              City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1624.0
2     Los Angeles   399

### Challenge 5
Update the 'Year_Founded' to 1800 for cities with missing year values

In [138]:
# Your code here
print(cities_df)
cities_df.loc[cities_df['Year_Founded'].isna(), 'Year_Founded'] = 1800 #or is.null()
print('\n', cities_df)

             City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1624.0
2     Los Angeles   3990456.0         468.70             USA        1781.0
3         Chicago   3991123.0         227.30             USA        1833.0
4         Houston   2312717.0         627.80             USA        1836.0
5   San Francisco    870887.0          46.90             USA           NaN
6          London   8982000.0         607.00  United Kingdom          43.0
7           Paris   2140526.0         537.85          France          52.0
8           Tokyo   3991123.0         845.80           Japan         660.0
9          Sydney   2079700.0        2058.40       Australia        1788.0
10        Toronto   2930000.0         630.20          Canada           NaN

              City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1624.0
2     Los Angeles   399

### Challenge 6
Update the 'Year_Founded' to 1800 for cities in the USA founded before 1800


In [139]:
# Your code here
print(cities_df)
cities_df.loc[cities_df['Year_Founded'] < 1800, 'Year_Founded'] = 1800
print('\n', cities_df)

             City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1624.0
2     Los Angeles   3990456.0         468.70             USA        1781.0
3         Chicago   3991123.0         227.30             USA        1833.0
4         Houston   2312717.0         627.80             USA        1836.0
5   San Francisco    870887.0          46.90             USA        1800.0
6          London   8982000.0         607.00  United Kingdom          43.0
7           Paris   2140526.0         537.85          France          52.0
8           Tokyo   3991123.0         845.80           Japan         660.0
9          Sydney   2079700.0        2058.40       Australia        1788.0
10        Toronto   2930000.0         630.20          Canada        1800.0

              City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1800.0
2     Los Angeles   399

### Challenge 7
Set the 'Population' to 8 million for cities with a population greater than 8 million

In [140]:
# Your code here
print(cities_df)
cities_df.loc[cities_df['Population'] > 8000000, 'Population'] = 8000000
print('\n', cities_df)

             City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8622698.0         302.60             USA        1800.0
2     Los Angeles   3990456.0         468.70             USA        1800.0
3         Chicago   3991123.0         227.30             USA        1833.0
4         Houston   2312717.0         627.80             USA        1836.0
5   San Francisco    870887.0          46.90             USA        1800.0
6          London   8982000.0         607.00  United Kingdom        1800.0
7           Paris   2140526.0         537.85          France        1800.0
8           Tokyo   3991123.0         845.80           Japan        1800.0
9          Sydney   2079700.0        2058.40       Australia        1800.0
10        Toronto   2930000.0         630.20          Canada        1800.0

              City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8000000.0         302.60             USA        1800.0
2     Los Angeles   399

### Challenge 8
Use .replace() to update 'Country' names 'United Kingdom' to 'UK' and 'Canada' to 'CAN'


In [141]:
# Your code here
print(cities_df)
cities_df.loc[cities_df['Country'] == 'United Kingdom', 'Country'] = 'UK'
cities_df.loc[cities_df['Country'] == 'Canada', 'Country'] = 'CAN'
cities_df.loc[cities_df['Country'] == 'France', 'Country'] = 'FRA'
cities_df.loc[cities_df['Country'] == 'Japan', 'Country'] = 'JAP'
cities_df.loc[cities_df['Country'] == 'Australia', 'Country'] = 'AUS'
print(cities_df)

             City  Population  Area_sq_miles         Country  Year_Founded
0        New York   8000000.0         302.60             USA        1800.0
2     Los Angeles   3990456.0         468.70             USA        1800.0
3         Chicago   3991123.0         227.30             USA        1833.0
4         Houston   2312717.0         627.80             USA        1836.0
5   San Francisco    870887.0          46.90             USA        1800.0
6          London   8000000.0         607.00  United Kingdom        1800.0
7           Paris   2140526.0         537.85          France        1800.0
8           Tokyo   3991123.0         845.80           Japan        1800.0
9          Sydney   2079700.0        2058.40       Australia        1800.0
10        Toronto   2930000.0         630.20          Canada        1800.0
             City  Population  Area_sq_miles Country  Year_Founded
0        New York   8000000.0         302.60     USA        1800.0
2     Los Angeles   3990456.0         468

### Challenge 9
Concatenate 'City' and 'Country' columns into a new 'Location' column


In [142]:
# Your code here
print(cities_df)

cities_location_df = pd.DataFrame(cities_df['City'] + ', ' + cities_df['Country'], columns = ['Location'])
cities_df = pd.concat([cities_df, cities_location_df], axis = 1)
print('\n', cities_df)

             City  Population  Area_sq_miles Country  Year_Founded
0        New York   8000000.0         302.60     USA        1800.0
2     Los Angeles   3990456.0         468.70     USA        1800.0
3         Chicago   3991123.0         227.30     USA        1833.0
4         Houston   2312717.0         627.80     USA        1836.0
5   San Francisco    870887.0          46.90     USA        1800.0
6          London   8000000.0         607.00      UK        1800.0
7           Paris   2140526.0         537.85     FRA        1800.0
8           Tokyo   3991123.0         845.80     JAP        1800.0
9          Sydney   2079700.0        2058.40     AUS        1800.0
10        Toronto   2930000.0         630.20     CAN        1800.0

              City  Population  Area_sq_miles Country  Year_Founded  \
0        New York   8000000.0         302.60     USA        1800.0   
2     Los Angeles   3990456.0         468.70     USA        1800.0   
3         Chicago   3991123.0         227.30     US

### Challenge 10
Drop the 'Location' column

In [143]:
# Your code here
print(cities_df)
cities_df = cities_df.drop('Location', axis = 1)
print('\n', cities_df)

             City  Population  Area_sq_miles Country  Year_Founded  \
0        New York   8000000.0         302.60     USA        1800.0   
2     Los Angeles   3990456.0         468.70     USA        1800.0   
3         Chicago   3991123.0         227.30     USA        1833.0   
4         Houston   2312717.0         627.80     USA        1836.0   
5   San Francisco    870887.0          46.90     USA        1800.0   
6          London   8000000.0         607.00      UK        1800.0   
7           Paris   2140526.0         537.85     FRA        1800.0   
8           Tokyo   3991123.0         845.80     JAP        1800.0   
9          Sydney   2079700.0        2058.40     AUS        1800.0   
10        Toronto   2930000.0         630.20     CAN        1800.0   

              Location  
0        New York, USA  
2     Los Angeles, USA  
3         Chicago, USA  
4         Houston, USA  
5   San Francisco, USA  
6           London, UK  
7           Paris, FRA  
8           Tokyo, JAP  
9   

### Challenge 11
Create a new column 'Size' and assign it a value 'unknown' to all rows


In [144]:
# Your code here
cities_df['Size'] = 'unknown'
cities_df

Unnamed: 0,City,Population,Area_sq_miles,Country,Year_Founded,Size
0,New York,8000000.0,302.6,USA,1800.0,unknown
2,Los Angeles,3990456.0,468.7,USA,1800.0,unknown
3,Chicago,3991123.0,227.3,USA,1833.0,unknown
4,Houston,2312717.0,627.8,USA,1836.0,unknown
5,San Francisco,870887.0,46.9,USA,1800.0,unknown
6,London,8000000.0,607.0,UK,1800.0,unknown
7,Paris,2140526.0,537.85,FRA,1800.0,unknown
8,Tokyo,3991123.0,845.8,JAP,1800.0,unknown
9,Sydney,2079700.0,2058.4,AUS,1800.0,unknown
10,Toronto,2930000.0,630.2,CAN,1800.0,unknown


### Challenge 12
Change Size to 'Small' for all cities with 'Area_sq_miles' less than 400.  
Change Size to 'Big' for all cities with 'Area_sq_miles' bigger than 400.

In [145]:
# Your code here
cities_df.loc[cities_df['Area_sq_miles'] < 400, 'Size'] = 'Small'
cities_df['Size'] = cities_df['Size'].where(cities_df['Area_sq_miles'] < 400, 'Big')
cities_df['Size'] = cities_df['Size'].replace({'unknown': 400})
cities_df

Unnamed: 0,City,Population,Area_sq_miles,Country,Year_Founded,Size
0,New York,8000000.0,302.6,USA,1800.0,Small
2,Los Angeles,3990456.0,468.7,USA,1800.0,Big
3,Chicago,3991123.0,227.3,USA,1833.0,Small
4,Houston,2312717.0,627.8,USA,1836.0,Big
5,San Francisco,870887.0,46.9,USA,1800.0,Small
6,London,8000000.0,607.0,UK,1800.0,Big
7,Paris,2140526.0,537.85,FRA,1800.0,Big
8,Tokyo,3991123.0,845.8,JAP,1800.0,Big
9,Sydney,2079700.0,2058.4,AUS,1800.0,Big
10,Toronto,2930000.0,630.2,CAN,1800.0,Big
