# Challenge - Global warming data analysis

![global warming](https://th.bing.com/th/id/R.59a9ab47b7b9a1e50fb75b124c9b3c9f?rik=qBHpxbnGeuMNUA&pid=ImgRaw&r=0)

### Description

It's now time to look at our first dataset, including visualisation and data cleaning. 

In this case study, we will analyze the global land temperature data in the countries with a goal to find any underlying relationships between the change in temperature and the geographical location.

In addition, we should analyze the dataset as a whole: extracting statistical parameters, preprocessing the data and doing a bit of visualisation.

### Data

For this task, we will be using the [Climate Change: Earth Surface Temperature dataset](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data?select=GlobalLandTemperaturesByCountry.csv) which combines 1.6 billion temperature reports from 16 pre-existing archives (starting from 1750s). 


### Tasks

1. Preprocess and statistically describe the data
2. Find and visualize the 20 countries with the highest mean temperature
3. Which countries had the largest change in temperature?
4. What is the overall tendency?

To get the dataset, we need a kaggle account and a kaggle.json file. First, log in to kaggle (you can log in with your google account): \
https://www.kaggle.com/

Then, find the subheading 'API' in your settings \
https://www.kaggle.com/settings \
and click 'Create New Token' which will automatically download a kaggle.json file. 

Finally, using the folder icon on the left of this notebook <- upload the kaggle.json file to Colab.

In [None]:
# We need to move the kaggle.json file to the expected location 

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
import pandas as pd
import kaggle
import matplotlib.pyplot as plt
import seaborn as sns

# Import data

In [None]:
# Download the dataset from Kaggle
!kaggle datasets download -d berkeleyearth/climate-change-earth-surface-temperature-data

In [None]:
# Unzip the downloaded dataset
!unzip climate-change-earth-surface-temperature-data.zip

In [None]:
# Load the dataset into a Pandas DataFrame
data = pd.read_csv('GlobalLandTemperaturesByCountry.csv')

### Reading and describing data

In [None]:
# Display the first few rows of the dataset
data.head()

In [None]:
#Brief information of the dataset
data.info()

In [None]:
# Summary statistics
print(data.describe())

In [None]:
#Find the number of countries
num_countries = len(data['Country'].unique())
print(f'There are {num_countries} unique countries in the dataset.')

### Preprocessing

As it can already be seen from the first glances at data, it contains a lot of NaN values which cannot be used for further analysis. Therefore, we need to remove these values from our dataset.

In [None]:
# Check the number of missing values
print(data.isna().sum())

In [None]:
# Check what percentage of the data is missing
missing_percentage = data.isna().sum() / len(data) * 100
print(f'Percentage of missing values in each column: \n{missing_percentage}')

In [None]:
# Function that deals with missing values
data.dropna(inplace = True)

In [None]:
#Check if everything went correctly
print(data.isna().sum())

In [None]:
# Check names of the countries - can you see any duplicates or mistakes?
countries = data['Country'].unique()
print(countries)

In [None]:
# replace 'United Kingdom (Europe)' with 'United Kingdom'
data['Country'] = data['Country'].replace('United Kingdom (Europe)', 'United Kingdom')

In [None]:
# Check if the replacement was successful
countries = data['Country'].unique()
print(countries)

### Visualizing data

Prior to completing the intended analysis, it also might be useful to explore the data in further.

In [None]:
# Boxplot of the average temperature of UK, China, and USA
subset_data = data[data['Country'].isin(['United Kingdom', 'China', 'United States'])]
plt.figure(figsize=(12, 6))
sns.boxplot(x='Country', y='AverageTemperature', data=subset_data)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Violinplot shows slightly more detail on the distribution of the data
plt.figure(figsize=(12, 6))
sns.violinplot(x='Country', y='AverageTemperature', data=subset_data)
plt.xticks(rotation=90)

### Largest Temperature Change

After performing the preprocessing and general data characterization, we can now find the countries that underwent the largest temperature change. There are numerous ways in which such task can be completed - the guided way is only a suggestion.

In [None]:
# Extracting the names of countries from the dataset
countries = data['Country'].unique()

In [None]:
# Extracting the change in temperature using the loop
Temp_Data = {}

for country in countries:
    #for each country find the 'initial' and last temperature values
    temp = data[data['Country'] == country]['AverageTemperature']
    a, b = temp.index[0], temp.index[-1]
    
    Temp_Data[country] = round((temp[b] - temp[a]), 2)

In [None]:
# Converting to dataframe and sorting in descending order
temp_change = pd.DataFrame(Temp_Data, index = [0]).T
highest_change = temp_change.sort_values([0], ascending = False)[:20]

In [None]:
# Generating bar plot
y = highest_change.values[:, 0]
x = highest_change.index
plt.bar(x, y)
plt.xticks(rotation = 90)
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Temperature Change', fontsize = 12)
plt.title('Countries with the largest temperature change', fontsize = 13)
plt.show()

### Overall Tendencies

There are a few ways in which we could determine the overall tendencies in the global surface temperature - we could analyze the temperature change per country or we could observe the change in the average global temperature. Let's look at both cases.

In [None]:
# Plot the bar graph for all countries

highest_change = temp_change.sort_values([0], ascending = False)
y = highest_change.values[:, 0]
x = highest_change.index
plt.bar(x, y)
plt.xticks(visible = False)
plt.ylabel('Temperature Change', fontsize = 12)
plt.title('Temperature Change Distribution', fontsize = 13)
plt.show()

In [None]:
# Average temperature over time
data['dt'] = pd.to_datetime[data['dt'])
data['year'] = data['dt'].dt.year
average_temp_by_year = data.groupby('year')['AverageTemperature'].mean()

plt.figure(figsize=(12, 6))
average_temp_by_year.plot()
plt.title('Average Temperature Over Time')
plt.xlabel('Year')
plt.ylabel('Average Temperature (°C)')
plt.grid(True)
plt.show()