#C. Data Analysis and Visualization for Climate Change (60 points)
In this part, you will work with a dataset GlobalLandTemperaturesByState.csv containing historical climate data for states across the world from the year 1744 to 2013. The dataset includes average temperature for various states and their respective date.


### **Question 1: Data Import (6 points)**
Use Pandas to import the climate change dataset into a DataFrame called `df_state`. Then find out all the country names from the 'Country' column and print them out. (there are a total of seven unique country names.)

In [None]:
import pandas as pd

In [None]:
# Import the climate change dataset GlobalLandTemperaturesByState.csv into a DataFrame called df_state.

# TODO_1.1


In [None]:
# Using Pandas to find out all the country names from the 'Country' column and print them out.
# Your output should print seven unique country names.

# TODO_1.2


### **Question 2: Data Cleaning (12 points, 12 points)**
The first step in examining any dataset involves the preparation and refinement of the data. Various forms of irregularities can occur during the data collection or curation process, and it is essential to rectify these issues before conducting any analysis.




**i.** Implement the function ***cleanse_country_data*** that does the followings:
- Some country names include additional abbreviation, such as "United States (US)". Create the function to simplify these names, we should discard any additional abbreviation. In a broader sense, any country name in the format "name1 (name2)" should be replaced with just "name1".
- The list `countries_to_remove` is provided because the data for these countries is inaccurate or incomplete."

In [None]:
countries_to_remove = ['Brazil', 'Russia']
def cleanse_country_data(df):
    """
    Remove countries in the countries_to_remove list from the dataframe df_state
    and simplify the country names that include additional abbreviation.

    kwargs:
        country_data (pd.DataFrame) : the input dataframe to preprocess

    return:
        pd.DataFrame : the preprocessed dataframe
    """

    # TODO_2.1

    return df

In [None]:
# This is the assistant's program for review, please do not delete.
def test_preprocess_countries():
    df_country_cleaned = cleanse_country_data(df_state.copy())
    assert df_country_cleaned.columns.equals(df_state.columns)
    assert df_country_cleaned.dtypes.equals(df_state.dtypes)
    assert len(df_country_cleaned) == 29699
    
    unique_countries = df_country_cleaned["Country"].unique()
    assert len(unique_countries) == 5
    assert 'United States' in unique_countries
    assert 'Brazil' not in unique_countries
    print("All tests passed!")

test_preprocess_countries()

**ii.** Missing data can cause issues when we're analyzing the data, and the easiest way to deal with this is to delete rows that have any missing values. Create the function ***drop_missing_values*** to eliminate rows in our datasets that have missing values in any column.

In [None]:
def drop_missing_values(df):
    """
    Drop rows with at least one missing value from an input dataframe.

    args:
        df (pd.DataFrame) : an input dataframe

    returns:
        pd.DataFrame : a subset of df where rows with missing values in any column are removed.
    """

    # TODO_2.2

    return df

In [None]:
# This is the assistant's program for review, please do not delete.
df_country_filtered = drop_missing_values(df_state.copy())
assert df_country_filtered.columns.equals(df_state.columns)
assert df_country_filtered.dtypes.equals(df_state.dtypes)
assert len(df_country_filtered) == 51831
print("All tests passed!")

### **Question 3: Data Analysis (12 points)**

We can get an overview of our dataset by examining summary statistics. To do this, we will use
Pandas to load DataFrame and then display key statistics such as
the minimum value, maximum value, average (mean), and standard deviation of the
 "AverageTemperature" column in `df_state`.

In [None]:
# Show the key statistics such as the minimum value, maximum value, average (mean),
# and standard deviation of the "AverageTemperature" column in df_state.

# TODO_3


### **Question 4: Outlier Detection (12 points, 6 points)**

We can identify outliers using the Interquartile Range (IQR) rule: a data point is considered outlier if it is at least 1.5 interquartile ranges below the first quartile (Q1), or at least 1.5 interquartile ranges above the third quartile (Q3), i.e.,

### $$\text{outlier} \le Q1 - 1.5 \times IQR \text{  OR  } \text{outlier} \ge Q3 + 1.5 \times IQR.$$

 Introduction of IQR: https://en.wikipedia.org/wiki/Interquartile_range


Create a function named ***remove_outliers***. This function will be responsible for removing rows from a DataFrame where the values in a specified column are identified as outliers based on the IQR rule.

After creating the function, apply it to the "AverageTemperature" column in our DataFrame `df_state` store the result in a new DataFrame called `df_removed`. Next, compare the minimum value, maximum value, average (mean), and standard deviation of `df_removed` to those of `df_state` where the ***remove_outliers*** function was not used."

In [None]:
def remove_outliers(df, col):
    """
    Remove any row whose data at a given column is considered outlier according to the IQR rule.

        args:
        df (pd.DataFrame) : an input dataframe where outlier rows should be removed
        col (str) : the column name to check for outlier

    return:
        pd.DataFrame : a subset of the input dataframe after outlier rows are removed
    """

    # TODO_4

    return df

In [None]:
# This is the assistant's program for review, please do not delete.
def test_remove_outliers():
    df_country_new = remove_outliers(df_state.copy(), "AverageTemperature")
    assert df_state.columns.equals(df_state.columns)
    assert df_state.dtypes.equals(df_state.dtypes)
    assert len(df_country_new) == 53786
    assert abs(df_country_new["AverageTemperature"].min() + 42.97) < 0.01
    assert abs(df_country_new["AverageTemperature"].max() - 32.21) < 0.01
    assert abs(df_country_new["AverageTemperature"].mean() + 3.36) < 0.01
    print("All tests passed!")

test_remove_outliers()