# [Cleaning Data in Python](https://www.datacamp.com/courses/cleaning-data-in-python)

## Diagnose data for cleaning

### [Example 2 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=2)

```python
# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

# Print the head and tail of df_subset
print(df_subset.head())
print(df_subset.tail())
```

### [Example 3 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=3)

```python
# Print the info of df
print(df.info())

# Print the info of df_subset
print(df_subset.info())
```

## EDA - Exploratory Data Analysis

`df.continent.value_counts(dropna=False)`

- With `dropna=False`, the null values will also be counted 

### [Example 6 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=6)

```python
# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))

# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))

# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))
```

## Bar plots and histograms
- Bar plots for discrete data counts
- Histograms for continuous data counts
- Look at frequencies

```python
df.boxplot(column='population', by='continent')
```

- Scatter plots for relationships between 2 numeric variables

### [Example 8 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=8)

```python
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()
```



### [Example 10 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=10)

```python
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
```

### [Example 11 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=1)
![Tidy Data](tidy_data.png)
![Tidy Data 2](tidy_data2.png)
![Tidy Data Paper](TidyData.pdf)
[Tidy Data Paper](TidyData.pdf)

```python
pd.melt(frame=df, id_vars='name', value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result')
```

### [Example 3 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=3)

```python
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
```

### [Example 4 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=4)

```python
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())
```

## Pivot Data

![Pivot Data](pivot.png)

```python
weather_tidy = weather.pivot(index='date', columns='element', values='value')
print(weather_tidy)
```

![Pivot Table](pivot_table.png)

```python
import numpy as np
weather2_tidy = weather.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean)
```

### [Example 6 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=6)

```python
# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
```

### [Example 7 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=7)

```python
# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
```

### [Example 8 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=8)

```python
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())
```

### [Example 10 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=10)

```python
# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())
```

### [Example 11 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=11)

```python
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())
```

## Pandas `concat()`

```python
pd.concat([weather_p1, weather_p2], ignore_index=True)
```

### [Example 2 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=2)

```python
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())
```

### [Example 3 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=3)

```python
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())
```

### [Example 5 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=5)

![concatenating many](concat_many.png)

![globbing](globbing.png)

```python
import glob
csv_files = glob.glob('*.csv')
print(csv_files)
```

- For Loop:

```python
list_data = []

for filename in csv_files:
    data = pd.read_csv(filename)
    list_data.append(data)
pd.concat(list_data)
```

```python
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())
```

### [Example 6 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=6)

```python
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())
```

### [Example 8 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=8)

- Merging Data
    - One-to-one
    - Many-to-one/one-to-many
    - Many-to-many
- One to one example (no duplicate values):
```python
pd.merge(left=state_populations, right=sate_codes, on=None, left_on='state', right_on='name')
```

```python
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(o2o)
```

### [Example 9 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=9)

```python
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print m2o
print(m2o)
```

### [Example 10 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=10)

```python
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))
```

## Data Types

- Changing dtypes

```python
df['treatment b'] = df['treatment b'].astype(str)
df['sex'] = df['sex'].astype('category')
df.dtypes
```

##### - Categorical data
- converting categorical data to `category` dtype:
    - can make the DataFrame smaller in memory
    - can make them be utilized by other Python libraries
##### - If we expect a numeric and recieve a string, this is normally a sign of messy data:
```python
df['treatment a'] = pd.to_numeric(df['treatment a'], errors='coerce')
```

### [Example 2 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=2)

```python
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())
```

![String Manipulation](string_manip.png)

### String Manipulation
- Many built-in and external libraries
- `re` library for regular expressions
    - A formal way of specifying a pattern
    - Sequence of characters
- Pattern matching
    - similar to globbing
### Regular Expression Examples:
![Regular Expression](regex.png)

### Using regular expressions
- Compile the pattern
- Use the compiled pattern to match values
- This lets us use the pattern over and over again
- Useful since we want to match values down a column of values

```python
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)
```
## [Regular Expression Library Documentation](https://docs.python.org/3/library/re.html)

### [Example 5 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=5)

```python
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))
```

### [Example 6 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=6)

```python
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)
```

### [Example 7 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=7)

```python
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)
```

### Using functions to clean data
- First write the regular expression

```python
import re
from numpy import NaN
pattern = re.compile('^\$\d*\.\d{2}$')
```

- Next, write the function

```python
def diff_money(row, pattern):
    icost = row['Initial Cost']
    tef = row['Total Est. Fee']
    
    if bool(pattern.match(icost)) and bool(pattern.match(tef)):
        
        icost = icost.replace("$", "")
        tef = tef.replace("$", "")
        
        icost = float(icost)
        tef = float(tef)
                              
        return icost - tef
    else:
        
        return(NaN)
```

- Call the function:

```python
df_subset['diff'] = df_subset.apply(diff_money, axis=1, pattern=pattern)
print(df_subset.head())
```

### [Example 9 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=9)

```python

```

### [Example 2 cleaning-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/cleaning-data-for-analysis?ex=2)

```python

```