In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

# Cleaning data in python


In [None]:
air_df = pd.read_csv('https://assets.datacamp.com/production/course_2023/datasets/airquality.csv')

In [None]:
air_df.head(5)

In [None]:
dob_df = pd.read_csv('https://assets.datacamp.com/production/course_2023/datasets/dob_job_application_filings_subset.csv')

In [None]:
dob_df.head()

 - Create a histogram of the 'Existing Zoning Sqft' column. Rotate the axis labels by 70 degrees and use a log scale for both axes.
 - Display the histogram using plt.show().

In [None]:
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
dob_df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()

While visualizing your data is a great way to understand it, keep in mind that no one technique is better than another. As you saw here, you still needed to look at the summary statistics to help understand your data better. You expected a large amount of counts on the left side of the plot because the 25th, 50th, and 75th percentiles have a value of 0. The plot shows us that there are barely any counts near the max value, signifying an outlier.

In [None]:
# Create the boxplot
dob_df.boxplot(column='initial_cost', by='Borough', rot=90)

# Display the plot
plt.show()

## Reshaping your data using melt
Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently.

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

 - Print the head of airquality.
 - Use pd.melt() to melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows. Do this by using id_vars to specify the columns you do not wish to melt: 'Month' and 'Day'.
Print the head of airquality_melt.

In [None]:
air_df.head()

In [None]:
airquality_melt = pd.melt(air_df, id_vars=['Month','Day'], value_vars=['Ozone', 'Solar.R', 'Wind', 'Temp'])

In [None]:
airquality_melt.head()

## Customizing melted data
When melting DataFrames, it would be better to have column names more meaningful than variable and value.

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. 

In [None]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(air_df, id_vars=['Month', 'Day'], 
                value_vars=['Ozone', 'Solar.R', 'Wind','Temp'], 
                var_name= 'measurement', value_name='reading')

In [None]:
airquality_melt.head(5)

## Pivot data
Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). 
 - Pivot airquality_melt by using .pivot_table() with the rows indexed by 'Month' and 'Day', the columns indexed by 'measurement', and the values populated with 'reading'.

In [None]:
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

In [None]:
airquality_pivot.head()

## Resetting the index of a DataFrame
After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What we got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).
There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index()

In [None]:
airquality_pivot.index

In [None]:
# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

In [None]:
airquality_pivot.index

In [None]:
airquality_pivot.head()

Pivoting duplicate values
So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. we can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. By using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

In [None]:
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)

In [None]:
airquality_pivot = airquality_pivot.reset_index()

In [None]:
airquality_pivot.head()

## Splitting a column with .str
The tb dataset consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

## Instructions

- Melt tb keeping 'country' and 'year' fixed.
- Create a 'gender' column by slicing the first letter of the variable column of tb_melt.
- Create an 'age_group' column by slicing the rest of the variable column of tb_melt.
- Print the head of tb_melt. This has been done for you, so hit 'Submit Answer' to see the results!


In [None]:
tb = pd.read_csv('https://assets.datacamp.com/production/course_2023/datasets/tb.csv')

In [None]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

In [None]:
tb_melt.head()

## Splitting a column with .split() and .get()
Another common way multiple variables are stored in columns is with a delimiter. 
This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using Cases_Guinea.split('_'), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.
## Instructions

- Melt ebola using 'Date' and 'Day' as the id_vars, 'type_country' as the var_name, and 'counts' as the value_name.
- Create a column called 'str_split' by splitting the 'type_country' column of ebola_melt on '_'. Note that you will first have to access the str attribute of type_country before you can use .split().
- Create a column called 'type' by using the .get() method to retrieve index 0 of the 'str_split' column of ebola_melt.
- Create a column called 'country' by using the .get() method to retrieve index 1 of the 'str_split' column of ebola_melt.
- Print the head of ebola. This has been done for you, so hit 'Submit Answer' to view the results!

In [None]:
ebola = pd.read_csv('https://assets.datacamp.com/production/course_2023/datasets/ebola.csv')

In [None]:
ebola.columns

In [None]:
ebola.head(5)

In [None]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date','Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_')

In [None]:
ebola_melt.head()

In [None]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

In [None]:
ebola_melt.head()

# Combining data for analysis
The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

## Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.



In [None]:
# create a new dataframe 
status_country = ebola_melt[['type','country']]
status_country.columns = ['status','country']
status_country.head()

In [None]:
# copy ebola_melt
melt_ebola = ebola_melt.copy(deep=True)

In [None]:
# drop type and country columns
melt_ebola = melt_ebola.drop(columns=['type','country'])
melt_ebola.head()

In [None]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([melt_ebola, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)


In [None]:

# Print the head of ebola_tidy
ebola_tidy.head()


# Finding and concatenating data
## Finding files that match a pattern
We're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, we'll programmatically load them into DataFrames.

The glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part_ single digit number .csv, you can write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [None]:
import glob

In [None]:
# Write the pattern: pattern
pattern = 'uber*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)
data_list = []
for csv in csv_files:
    df = pd.read_csv(csv)
    data_list.append(df)
    


In [None]:
# Load the second file into a DataFrame: csv2
uber = pd.concat(data_list)

# Print the head of csv2
uber.head()

# Using regular expressions to clean strings

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

## Instruction
- Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
 * Use \d{x} to match x digits. Here you'll need to use it three times: twice to match 3 digits, and once to match 4 digits.
 * Place the regular expression inside re.compile().
- Using the .match() method on prog, check whether the pattern matches the string '123-456-7890'.
Using the same approach, now check whether the pattern matches the string '1123-456-7890'.

In [None]:
import re

In [None]:
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')

In [None]:
bool(result)

In [None]:
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

In [None]:
# See if the pattern matches
result = prog.match('123-456-7890')

In [None]:
bool(result)

In [None]:
result = prog.match('1123-456-7890')
bool(result)

# Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

`Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas`

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the `re.findall()` function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to `re.findall()`, and it will return a list of the matches.

## Instruction

- Write a pattern that will find all the numbers in the following string: `the recipe calls for 10 strawberries and 1 banana`
 To do this:
 - Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
 - `\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times. This ensures that `10` is viewed as one number and not as `1` and `0`.
- Print the matches to confirm that your regular expression found the values `10` and `1`.

In [None]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

## More Pattern matching
- Write patterns to match:
  - A telephone number of the format `xxx-xxx-xxxx`. You already did this in a previous exercise.
  - A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, `2` digits.
     - Use `\$` to match the dollar sign, `\d*` to match an arbitrary  number of digits, `\`. to match the decimal point, and `\d{x}` to match `x` number of digits.
- A capital letter, followed by an arbitrary number of alphanumeric characters.
  - Use `[A-Z]` to match any capital letter followed by `\w*` to match an arbitrary number of alphanumeric characters.

In [None]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
pattern1

In [None]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
pattern2

In [None]:
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
pattern3

# Using functions to clean data
## Custom functions to clean data
the tips dataset will be used. It has a 'sex' column that contains the values `'Male'` or `'Female'`. Your job is to write a function that will recode `'Male'` to `1`, `'Female'` to `0`, and return np.nan for all entries of `'sex'` that are neither `'Male'` nor `'Female'`.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

You can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the `'sex'` column.

## Instructions
- Define a function named `recode_sex()` that has one parameter: sex_value.
  - If sex_value equals `'Male'`, return `1`.
  - Else, if `sex_value` equals `'Female'`, return `0`.
  - If sex_value does not equal `'Male'` or `'Female'`, return `np.nan`. 
- Apply your `recode_sex()` function over `tips.sex` using the `.apply()` method to create a new column: 'sex_recode'. Note that when passing in a function inside the `.apply()` method, you don't need to specify the parentheses after the function name.


In [None]:
tips = pd.read_csv('tips.csv')

In [None]:
# Define recode_sex()
def recode_sex(sex_value):

    # Return 1 if sex_value is 'Male'
    if sex_value == 'Male':
        return 1
    
    # Return 0 if sex_value is 'Female'    
    elif sex_value == 'Female':
        return 0
    
    # Return np.nan    
    else:
        return np.nan

In [None]:
# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)

In [None]:
tips.head()

## Lambda functions
Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an `.apply()` method:

```def my_square(x):
       return x ** 2```


`df.apply(my_square)`

The equivalent code using a lambda function is:

```df.apply(lambda x: x ** 2)```

The lambda function takes one parameter `-` the variable `x`. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to clean its 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the `.replace()` method, and with regular expressions. The regular expression module re has been pre-imported.

## Intructions

- Use the `.replace()` method inside a lambda function to remove the dollar sign from the 'total_dollar' column of tips.
  - You need to specify two arguments to the `.replace()` method: The string to be replaced `('$')`, and the string to replace it by `('')`.
  - Apply the lambda function over the 'total_dollar' column of tips.
- Use a regular expression to remove the dollar sign from the `'total_dollar'` column of tips.
  - The pattern has been provided for you: It is the first argument of the `re.findall()` function.
  - Complete the rest of the lambda function and apply it over the `'total_dollar'` column of tips. Notice that because `re.findall()` returns a list, you have to slice it in order to access the actual value.

In [None]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips['total_dollar'].apply(lambda x: x.replace('$', ''))

In [None]:
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips['total_dollar'].apply(lambda x: re.findall('\d+\.\d+', x)[0])


In [None]:
tips.head()

# Duplicate and missing data
## Dropping duplicate data
Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

A dataset consisting of the performance of songs on the Billboard charts has been pre-loaded into a DataFrame called billboard. Check out its columns in the IPython Shell. Your job in this exercise is to subset this DataFrame and then drop all duplicate rows.

In [None]:
tracks = pd.read_csv('tracks.csv')

In [None]:
# Print info of tracks
print(tracks.info())

In [None]:
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

# Filling missing data
Here, you'll return to the airquality dataset from Chapter 2. It has been pre-loaded into the DataFrame airquality, and it has missing values for you to practice filling in. Explore airquality in the IPython Shell to checkout which columns have missing values.

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [None]:
airquality = pd.read_csv('airquality.csv')

In [None]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality['Ozone'].mean()

In [None]:
airquality.info()

In [None]:
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality['Ozone'].fillna(oz_mean)

In [None]:
airquality.info()

# Testing with asserts
We use the `.all()` method together with the `.notnull()` DataFrame method to check for missing values in a column. The `.all()` method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another `.all()` method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.

Note: You can use `pd.notnull(df)` as an alternative to `df.notnull()`.

In [None]:
# Assert that there are no missing values
ebola1 = pd.read_csv('ebola.csv')
assert pd.notnull(ebola1).all().all()

In [None]:
# Assert that all values are >= 0
assert (ebola1 >= 0).all().all()

In [None]:
ebola1.info()

In [None]:
ebola1.head()

In [16]:
gapminder = pd.read_csv('gapminder.csv')

FileNotFoundError: File b'gapminder.csv' does not exist

In [11]:
!pwd


2

In [15]:
len(data[0].split('\n'))

1