# 3.1 Cleaning dirty data

Before diving into data analysis, it is crucial to ensure that your dataset is clean and well-prepared. Data cleaning involves identifying and rectifying issues, such as missing values, incorrect data types, and typos, to ensure the data is accurate and usable. In this lesson, we will explore various data cleaning techniques using Python Pandas, which will help us transform messy data into a structured and reliable format.

## Why data cleaning matters

Data cleaning is a critical step in the data analysis process for several reasons:

- Accurate analysis: Clean data ensures the reliability of your analysis, leading to accurate results and insights.
- Improved data quality: High-quality data leads to better decision-making and more robust conclusions.
- Enhanced efficiency: Clean data reduces the risk of errors and streamlines the analysis process.

A big advantage of cleaning data in Python is that you can run the same code over and over. This is especially useful when you will continually update a dataset!

## Data integrity checks

First, as always, start with checking the integrity of your dataset. This means checking:

- the number of columns and rows (shape)
- the numerical columns - looking for outliers or other weird issues (describe)
- the number of blanks (isnull)
- data types (dtypes)
- misspellings and other typos

And then fixing any issues you find!

###Import data

For this lesson we'll work with a database of movies and shows on Netflix.

In [None]:
import pandas as pd
import numpy as np

# Import data from a CSV file
df = pd.read_csv('https://raw.githubusercontent.com/dnmalan/advanced-data-journalism-23/main/data/netflix_titles.csv')

# Display the first few rows of the DataFrame
df.head()

Right off the bat we can see that there are a few issues with this dataset:

- missing values (NAN = Not A Number, which is how pandas denotes blanks or nulls)
- columns with lots of values separated by commas (listed_in)
- columns with words and numbers (date_added)
- the duration has mixed units (minutes/seasons)

### Check record count and check for blanks

Let's run a few more checks to get to know our data better.

In [None]:
#check how many rows and columns

df.shape

In [None]:
#how many blanks?

df.isnull()

In [None]:
# count how many nulls by adding sum()

df.isnull().sum()

In [None]:
#what percentage of each column is null?

df.isnull().mean()

We can see that some columns have very few NaNs, so we won't worry too much about them. But some of the columns, like director, cast, and country, have a higher percentage. These might be a problem.

You can deal with missing values in a few ways. If there are a larger number of blanks, you likely can't use the column or would have to fill in the missing data. If there are a smaller number of blanks you can likely ignore them. In some cases you might use the column but put a disclaimer on your analysis.

For now, let's get rid of the NaNs by replacing them with true blanks ("") using the **.fillna()** method.

In [None]:
# names of the columns
columns = ['director', 'cast', 'country', 'rating', 'date_added']

# looping through the columns to fill the entries with NaN values with ""
for column in columns:
    df[column] = df[column].fillna("")

### Check data types

Now let's check the data types in each column. This affects how we will write our code later, so it's very important. You might need to changes data types depending on what you want to do. For example, numbers are sometimes stored as text (or objects), and mathematical functions won't work unless you change the column type to numbers.

In [None]:
# check column data types

df.dtypes

Here we see most of the columns are "object" (text aka string), with the exception of year, which is an "int64" (integer). This should be fine for our analysis. If you need to change datatypes, you can use functions such as **to_numeric()**.

### Getting rid of trailing or leading spaces

Text data sometimes has spaces in front of or after the text. You might not be able to see it, but to the computer, it is there, and it means your data is dirty. Values of "hello" and "hello " are two different values, even though we want them to be the same.

With text columns, it's a good idea to get rid of trailing and leading spaces before doing any further cleaning. It can save you some trouble!



### Checking for typos, misspellings, and general dirty data

You should check text columns for misspellings and typos that could come into play when you're grouping. You can do this by running a **group by** as we did in Lesson 2, or you can also use another function called **unique**, which returns every unique value in a column.

In this dataset, we know that columns like director and cast will have many different unique values, so it doesn't make sense to check them. But let's check the rating to see what values exist in that column.

In [None]:
#check unique values in rating column

df['rating'].unique()

Most of these look good, but there are some values that appear to be durations (84 mins, 66 mins). We won't fix them here today, but you can move these to the duration column where they belong.

There are also two values that mean the same thing (NR or "not rated" and UR or "unrated"). We can easily fix these by changing all the URs to NRs (the correct designation).

We'll look at two different ways to do this, since cleaning up data this way is very common!



In [None]:
# Method 1: Using a for loop.
# This loops through the rows, and each time it sees a UR, it replaces it with an NR

for i in range(len(df)):
    if df['rating'].iloc[i] == "UR":
       df['rating'].iloc[i] = "NR"

In [None]:
# Method 2: Using WHERE from the numpy library
# This uses a filter to search for the bad values and replaces them with the new ones.

df['rating'] = np.where((df['rating'] == 'UR'), 'NR', df['rating'])


In [None]:
# check the results

df['rating'].unique()

That's as far as we'll go with data cleaning today. Check out [this blog](https://www.einblick.ai/blog/data-cleaning-with-python/) for some other great functions and tips.

## Automation through APIs

APIs, or Application Programming Interfaces, are a set of rules and protocols that allow different software applications to communicate with each other. They define the methods and data formats that applications can use to request and exchange information. APIs are commonly used to access web services, retrieve data from external sources, or perform specific actions programmatically. In this lesson, we will explore what APIs are and how to use Python to connect to a sample API.

### Connecting to a Sample API: OpenWeatherMap

[OpenWeatherMap](https://openweathermap.org/) is a popular weather data provider that offers an API to access weather information for various locations around the world. We will use their free API to retrieve weather data for a specific city.

In [None]:
import requests

# Replace with your API key
api_key = '2a0e9b7229dd5307178422c8f6c88b9a'

# Define the API endpoint URL
api_url = f'https://api.openweathermap.org/data/2.5/onecall?lat=39.90&lon=116.40&exclude=hourly,daily&units=metric&appid={api_key}'

# Send a GET request to the API
response = requests.get(api_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    
    # Access the weather information in the 'current' object
    current_weather = data['current']
    
    # Extract relevant weather information
    temperature = current_weather['temp']
    humidity = current_weather['humidity']
    weather_description = current_weather['weather'][0]['description']
    
    # Display the weather information
    print(f'Temperature: {temperature}°C')
    print(f'Humidity: {humidity}%')
    print(f'Weather Description: {weather_description}')
else:
    print('Failed to retrieve weather data.')

### examples of some other API urls for open weather

get data by latitude/longitude
https://api.openweathermap.org/data/3.0/onecall?lat={lat}&lon={lon}&appid={APIkey}

request historical data
https://api.openweathermap.org/data/3.0/onecall/timemachine?lat={lat}&lon={lon}&dt={time}&appid={APIkey}



## Querying World Bank data API for GDP data

Information about the API at this link: https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structures

In [None]:
import requests

# Define the base URL for the World Bank Open Data API
base_url = "https://api.worldbank.org/v2/country/CN/indicator/NY.GDP.MKTP.CD"

# Define the parameters for the query
params = {
    "format": "json",
    "per_page": 10000,  # Set a high number to retrieve all available data points
}

# Send a GET request to the World Bank API
response = requests.get(base_url, params=params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    
    # Extract and print the GDP data for China over time
    for item in data[1]:
        year = item['date']
        gdp = item['value']
        print(f"Year: {year}, GDP: {gdp}")
else:
    print('Failed to retrieve GDP data for China.')

In [None]:
import requests
import pandas as pd
import matplotlib.pyplot as plt

# Define the base URL for the World Bank Open Data API
base_url = "https://api.worldbank.org/v2/country/CN/indicator/NY.GDP.MKTP.CD"

# Define the parameters for the query
params = {
    "format": "json",
    "per_page": 10000,  # Set a high number to retrieve all available data points
}

# Send a GET request to the World Bank API
response = requests.get(base_url, params=params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()[1]
    
    # Extract and format the data into a DataFrame
    df = pd.DataFrame(data)
    df['date'] = pd.to_datetime(df['date'])  # Convert the date column to datetime
    
    # Create a line chart
    plt.figure(figsize=(12, 6))
    plt.plot(df['date'], df['value'], marker='o', linestyle='-')
    plt.xlabel('Year')
    plt.ylabel('GDP (Current US$)')
    plt.title('GDP of China Over Time')
    plt.grid(True)
    plt.show()
else:
    print('Failed to retrieve GDP data for China.')