<H5>Exploring data sets related to project. 


In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns   
import matplotlib.pyplot as plt  
import os

Define column names.
<br>Read data.
<br>Set column names.

In [2]:
column_names = ['Date', 'Location', 'Country', 'Temperature', 'CO2_Emissions', 'Sea_Level_Rise', 'Precipitation', 'Humidity', 'Wind_Speed']

climate = pd.read_csv('climate_change_data.csv', names=column_names, header=0)
climate.head(5)

Unnamed: 0,Date,Location,Country,Temperature,CO2_Emissions,Sea_Level_Rise,Precipitation,Humidity,Wind_Speed
0,2000-01-01 00:00:00.000000000,New Williamtown,Latvia,10.688986,403.118903,0.717506,13.835237,23.631256,18.492026
1,2000-01-01 20:09:43.258325832,North Rachel,South Africa,13.81443,396.663499,1.205715,40.974084,43.982946,34.2493
2,2000-01-02 16:19:26.516651665,West Williamland,French Guiana,27.323718,451.553155,-0.160783,42.697931,96.6526,34.124261
3,2000-01-03 12:29:09.774977497,South David,Vietnam,12.309581,422.404983,-0.475931,5.193341,47.467938,8.554563
4,2000-01-04 08:38:53.033303330,New Scottburgh,Moldova,13.210885,410.472999,1.135757,78.69528,61.789672,8.001164


<H1>Exploratory Data Analysis

In [3]:
climate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            10000 non-null  object 
 1   Location        10000 non-null  object 
 2   Country         10000 non-null  object 
 3   Temperature     10000 non-null  float64
 4   CO2_Emissions   10000 non-null  float64
 5   Sea_Level_Rise  10000 non-null  float64
 6   Precipitation   10000 non-null  float64
 7   Humidity        10000 non-null  float64
 8   Wind_Speed      10000 non-null  float64
dtypes: float64(6), object(3)
memory usage: 703.3+ KB


In [4]:
climate.describe()

Unnamed: 0,Temperature,CO2_Emissions,Sea_Level_Rise,Precipitation,Humidity,Wind_Speed
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,14.936034,400.220469,-0.003152,49.881208,49.771302,25.082066
std,5.030616,49.696933,0.991349,28.862417,28.92932,14.466648
min,-3.803589,182.13122,-4.092155,0.010143,0.018998,0.001732
25%,11.577991,367.10933,-0.673809,24.497516,24.71325,12.539733
50%,14.981136,400.821324,0.002332,49.818967,49.678412,24.910787
75%,18.305826,433.307905,0.675723,74.524991,75.20639,37.67026
max,33.976956,582.899701,4.116559,99.9919,99.959665,49.997664


How many countries are represented?  How many times does each country appear?

In [5]:
unique_country = climate['Country'].nunique()
print(f'Number of unique countries: {unique_country}')

country_count = climate['Country'].value_counts()
print(f'Count of each country:\n{country_count}')

Number of unique countries: 243
Count of each country:
Country
Congo                        94
Korea                        76
Tanzania                     61
Armenia                      58
French Guiana                58
                             ..
Saint Pierre and Miquelon    28
Chile                        28
Dominican Republic           28
Syrian Arab Republic         27
Saint Kitts and Nevis        23
Name: count, Length: 243, dtype: int64


<h1>Find outliers
<br>
<h2>Interquartile Range
<br>Focus on the middle 50% of the data to minimize influences of extreme values

In [6]:
def identify_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Return a boolean Series where True means the value is NOT an outlier
    return column.between(lower_bound, upper_bound, inclusive='both')

In [7]:
columns_to_check = ['Temperature', 'CO2_Emissions', 'Sea_Level_Rise', 'Precipitation', 'Humidity', 'Wind_Speed']
# Apply the outlier detection function to each specified column and store the result
mask = pd.concat([identify_outliers(climate[col]) for col in columns_to_check], axis=1).all(axis=1)

# Filter the DataFrame to only include rows where all specified columns are NOT outliers
df_filtered = climate[mask]

print(df_filtered)


                               Date           Location                Country  \
0     2000-01-01 00:00:00.000000000    New Williamtown                 Latvia   
1     2000-01-01 20:09:43.258325832       North Rachel           South Africa   
2     2000-01-02 16:19:26.516651665   West Williamland          French Guiana   
3     2000-01-03 12:29:09.774977497        South David                Vietnam   
4     2000-01-04 08:38:53.033303330     New Scottburgh                Moldova   
...                             ...                ...                    ...   
9995  2022-12-27 15:21:06.966696576   South Elaineberg                 Bhutan   
9996  2022-12-28 11:30:50.225022464       Leblancville                  Congo   
9997  2022-12-29 07:40:33.483348224     West Stephanie              Argentina   
9998  2022-12-30 03:50:16.741674112        Port Steven                Albania   
9999  2022-12-31 00:00:00.000000000  West Anthonyburgh  Sao Tome and Principe   

      Temperature  CO2_Emis

In [8]:
df_filtered.describe()

Unnamed: 0,Temperature,CO2_Emissions,Sea_Level_Rise,Precipitation,Humidity,Wind_Speed
count,9777.0,9777.0,9777.0,9777.0,9777.0,9777.0
mean,14.92646,400.408051,0.003025,49.947072,49.809217,25.11429
std,4.880705,48.044085,0.962848,28.870203,28.925483,14.469197
min,1.486266,268.16435,-2.695787,0.010143,0.018998,0.001732
25%,11.588366,367.584788,-0.667428,24.577947,24.773542,12.599789
50%,14.976462,400.882208,0.00635,49.853747,49.689499,24.990789
75%,18.279137,433.068693,0.672493,74.607422,75.290184,37.676352
max,28.330762,532.556055,2.698498,99.9919,99.959665,49.997664


Calculate Averages by Country

In [9]:
avg_by_country = df_filtered.groupby('Country')[['Temperature', 'CO2_Emissions', 'Sea_Level_Rise', 'Precipitation', 'Humidity', 'Wind_Speed']].mean()
avg_by_country = avg_by_country.reset_index()
print(avg_by_country) 


               Country  Temperature  CO2_Emissions  Sea_Level_Rise  \
0          Afghanistan    14.723790     398.341576        0.073322   
1              Albania    15.646637     403.086827       -0.251437   
2              Algeria    14.452264     401.268141       -0.286792   
3       American Samoa    15.895872     392.449374        0.156509   
4              Andorra    15.074974     410.238930       -0.030364   
..                 ...          ...            ...             ...   
238  Wallis and Futuna    15.236471     390.350860        0.033463   
239     Western Sahara    15.354093     385.040054       -0.048347   
240              Yemen    14.937284     395.964733        0.166413   
241             Zambia    14.753032     398.803052       -0.222091   
242           Zimbabwe    14.817535     413.141296       -0.092796   

     Precipitation   Humidity  Wind_Speed  
0        47.833721  50.510407   23.679759  
1        56.265142  54.208546   25.883135  
2        45.423037  48.3603

In [10]:
#Sort by country and date
df_sorted = df_filtered.sort_values(by=['Country', 'Date'])

# Calculate the change in sea level for each country
df_sorted['Sea_Level_Change'] = df_sorted.groupby('Country')['Sea_Level_Rise'].diff()
print(df_sorted)


                               Date              Location      Country  \
96    2000-03-21 15:33:12.799279928           Port Donald  Afghanistan   
172   2000-05-24 11:52:00.432043204             Kaylaberg  Afghanistan   
582   2001-05-03 22:17:36.345634560  West Jenniferborough  Afghanistan   
689   2001-08-01 19:37:44.986498648           Samuelville  Afghanistan   
1059  2002-06-08 15:34:30.567056704          Jenniferland  Afghanistan   
...                             ...                   ...          ...   
8230  2018-12-05 21:23:36.021602176             Jasonport     Zimbabwe   
9050  2020-10-24 18:14:47.848784896              Weisston     Zimbabwe   
9137  2021-01-05 20:20:31.323132288       North Shawntown     Zimbabwe   
9516  2021-11-20 05:44:46.228622848    South Jenniferfort     Zimbabwe   
9613  2022-02-09 17:27:42.286228608         Veronicahaven     Zimbabwe   

      Temperature  CO2_Emissions  Sea_Level_Rise  Precipitation   Humidity  \
96      22.174142     476.686303 