<a href="https://colab.research.google.com/github/carlos-alves-one/-BDA-CW1/blob/main/weather_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Goldsmiths University of London
### MSc. Data Science and Artificial Intelligence
### Module: Big Data Analysis
### Author: Carlos Manuel De Oliveira Alves
### Student: cdeol003
### Coursework Project

# Define Goal and Tasks

THE GOAL OF THIS PROJECT IS TO ANALYZE WEATHER DATA FROM A CSV FILE

We will outline the pseudo-code for mapper and reducer functions for each requested task using
the MapReduce computational model on a Hadoop cluster to address the tasks.
We will then provide Python implementations of these pseudo codes.

The tasks are:

1. Finding the difference between the maximum and minimum wind speed for each day.

2. Finding the daily minimum relative humidity.

3. Calculating the daily mean and variance of the dew point temperature.

4. Generating a correlation matrix for the month among relative humidity, wind speed,
   and dry bulb temperature.

# Load the data

In [4]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [13]:
# Import the pandas library and give it the alias 'pd' for data manipulation and analysis
import pandas as pd

# Load the dataset with the weather data for April from Google Drive
data_path = '/content/drive/MyDrive/weather_project/200704hourly.txt'

# Attempt to read the file, skipping problematic lines
data = pd.read_csv(data_path, on_bad_lines='skip')

# Display the first few rows of the dataframe
data.head(5).T


  data = pd.read_csv(data_path, on_bad_lines='skip')


Unnamed: 0,0,1,2,3,4
Wban Number,3011,3011,3011,3011,3011
YearMonthDay,20070401,20070401,20070401,20070401,20070401
Time,50,150,250,350,450
Station Type,AO2,AO2,AO2,AO2,AO2
Maintenance Indicator,-,-,-,-,-
Sky Conditions,SCT055,BKN055,OVC050,OVC050,BKN050
Visibility,10SM,10SM,10SM,10SM,10SM
Weather Type,-,-,-,-,-
Dry Bulb Temp,32,32,32,34,34
Dew Point Temp,23,23,23,23,23


# Question No.1

> Find the description statistics for temperature of each day of a given month for the year 2007

**Pseudo-code for the mapper function for Task No.1**


In [3]:
"""
function mapper1(key, value):
      parse value to get the date and wind speed
      emit (date, wind speed) as the key-value pair
"""


'\nfunction mapper1(key, value):\n      parse value to get the date and wind speed\n      emit (date, wind speed) as the key-value pair\n'


- **`function mapper1(key, value):`**
  - This line defines a function named `mapper1`. The function takes two parameters: `key` and `value`. In the context of MapReduce, each input is typically a key-value pair. The `key` might represent some identifier (often unused in the map step), and the `value` represents the data to be processed.

- **`parse value to get the date and wind speed`**
  - This instruction indicates that the function will process the `value` to extract two specific pieces of information: the date and the wind speed. The parsing method depends on the format of the input data. For example, if the value is a string in the format "YYYY-MM-DD, wind_speed", parsing would involve splitting the string by a delimiter (like a comma) and extracting the relevant parts.

- **`emit (date, wind speed) as the key-value pair`**
  - After parsing the value to extract the date and wind speed, the function "emits" or outputs a new key-value pair. In this context, the new key is the date, and the new value is the wind speed. The emit operation is a fundamental part of the MapReduce model, where each mapper function outputs zero or more key-value pairs, which are then processed by reducer functions.

The purpose of this mapper function is to transform raw data into a format that is more useful for analysis or further processing. By emitting the date and wind speed as key-value pairs, subsequent reducer functions can easily aggregate, summarize, or analyze wind speed data by date. This could be useful in various applications, such as analyzing weather patterns, forecasting, or studying the effects of climate change.

In [None]:
# Define the function mapper that takes a DataFrame as an argument
def mapper1(df):

    # Iterate over DataFrame rows to process each row individually
    for index, row in df.iterrows():

        # Extract the WBAN number (Weather Bureau Army Navy) from the first column of the DataFrame
        wban = row[0]

        # Extract the date from the second column of the DataFrame
        date = row[1]

        # Extract the wind speed from the 13th column (index 12) of the DataFrame
        wind_speed = row[12]

        try:
            # Attempt to convert the wind speed to a floating-point number to ensure it's numeric
            wind_speed = float(wind_speed)

            # If successful, print a composite key (WBAN-Date) and the wind speed, separated by a tab
            print(f"{wban}-{date}\t{wind_speed}")

        except ValueError:
            # If the conversion fails (e.g., because the wind speed is not a valid number), skip this row
            continue

# Call the function mapper with the dataframe to map the data
mapper1(data)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
26435-20070414	0.0
26435-20070414	0.0
26435-20070414	0.0
26435-20070414	0.0
26435-20070414	7.0
26435-20070415	6.0
26435-20070415	7.0
26435-20070415	9.0
26435-20070415	7.0
26435-20070415	4.0
26435-20070415	4.0
26435-20070415	8.0
26435-20070415	9.0
26435-20070415	8.0
26435-20070415	6.0
26435-20070415	4.0
26435-20070415	3.0
26435-20070415	6.0
26435-20070415	4.0
26435-20070415	8.0
26435-20070415	4.0
26435-20070415	5.0
26435-20070415	6.0
26435-20070415	9.0
26435-20070415	5.0
26435-20070415	0.0
26435-20070415	0.0
26435-20070415	0.0
26435-20070415	0.0
26435-20070416	0.0
26435-20070416	0.0
26435-20070416	3.0
26435-20070416	3.0
26435-20070416	0.0
26435-20070416	3.0
26435-20070416	0.0
26435-20070416	0.0
26435-20070416	3.0
26435-20070416	0.0
26435-20070416	3.0
26435-20070416	0.0
26435-20070416	4.0
26435-20070416	0.0
26435-20070416	3.0
26435-20070416	4.0
26435-20070416	4.0
26435-20070416	4.0
26435-20070416	0.0
26435-20070416	4.0
2643