# [Cleaning Data in Python](https://www.datacamp.com/courses/cleaning-data-in-python)

## Diagnose data for cleaning

### [Example 2 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=2)

```python
# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

# Print the head and tail of df_subset
print(df_subset.head())
print(df_subset.tail())
```

### [Example 3 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=3)

```python
# Print the info of df
print(df.info())

# Print the info of df_subset
print(df_subset.info())
```

## EDA - Exploratory Data Analysis

`df.continent.value_counts(dropna=False)`

- With `dropna=False`, the null values will also be counted 

### [Example 6 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=6)

```python
# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))

# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))

# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))
```

## Bar plots and histograms
- Bar plots for discrete data counts
- Histograms for continuous data counts
- Look at frequencies

```python
df.boxplot(column='population', by='continent')
```

- Scatter plots for relationships between 2 numeric variables

### [Example 8 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=8)

```python
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()
```



### [Example 10 exploring-your-data](https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data?ex=10)

```python
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
```

### [Example 11 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=1)
![Tidy Data](tidy_data.png)
![Tidy Data 2](tidy_data2.png)
![Tidy Data Paper](TidyData.pdf)
[Tidy Data Paper](TidyData.pdf)

```python
pd.melt(frame=df, id_vars='name', value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result')
```

### [Example 3 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=3)

```python
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
```

### [Example 4 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=4)

```python
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())
```

## Pivot Data

![Pivot Data](pivot.png)

```python
weather_tidy = weather.pivot(index='date', columns='element', values='value')
print(weather_tidy)
```

![Pivot Table](pivot_table.png)

```python
import numpy as np
weather2_tidy = weather.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean)
```

### [Example 6 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=6)

```python
# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
```

### [Example 7 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=7)

```python
# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
```

### [Example 8 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=8)

```python
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())
```

### [Example 10 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=10)

```python
# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())
```

### [Example 11 tidying-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/tidying-data-for-analysis?ex=11)

```python
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())
```

## Pandas `concat()`

```python
pd.concat([weather_p1, weather_p2], ignore_index=True)
```

### [Example 2 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=2)

```python

```

### [Example 2 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=2)

```python

```

### [Example 2 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=2)

```python

```

### [Example 2 combining-data-for-analysis](https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=2)

```python

```