# Weather prediction


## Part 1 - Data


### Problem to solve

The issue I aim to address is the prediction of weather conditions, constituting a regression problem. The objective is to forecast the upcoming week's weather, with a specific focus on temperature. To accomplish this, I will utilize a CSV file encompassing diverse parameters, including temperature, among others. Additionally, the file will specify the prevailing weather conditions on each corresponding day. The dataset will be segmented into three categories:

- Training data.
- Validation data.
- Test data.


### Data Source

The data source employed for this prediction is a CSV file acquired from the renowned website "Kaggle." Kaggle serves as a platform where diverse datasets can be accessed for various purposes. Users can evaluate the volume of available data and the popularity of specific sources, as indicated by the number of votes. Given that Kaggle allows open data contributions, it is imperative to exercise caution and scrutiny regarding the nature and quality of the data. I selected this particular dataset due to its substantial number of upvotes, signifying widespread utilization and satisfaction among users.

The URL for my specific dataset is: [https://www.kaggle.com/datasets/ananthr1/weather-prediction](https://www.kaggle.com/datasets/ananthr1/weather-prediction)


### The imports

These are the imports that will be needed to use this weather prediction:


In [None]:
import pandas as pd
import os

Ensure that you have these libraries installed in your Python environment before running the code.


### Load data

The initial step is to load the data from the CSV file in the "Data" folder. We will store the data in a variable named "Data."


In [None]:
data_file_path = os.path.abspath("./Data/seattle-weather.csv")
data = pd.read_csv(data_file_path)

### Info about the data

When the data is read, I then provide an overview of the data cells, displaying the following information:

- Column names.
- The count of null or non-null values in each row.
- The data type of each cell, such as string or float, for example.


#### Description of each column

- **Column name**
  <br>
  Provides the name of the column.

* **Non-null count**
  <br>
  Indicates the number of rows with a value in the column.

* **Null count**
  <br>
  Specifies the number of rows without any value in the column.

* **Data type**
  <br>
  Specifies the expected data type for the column.


In [None]:
data_columns_info = pd.DataFrame({
    'Non-Null Count': data.count(),
    'Null Count': data.isnull().sum(),
    'DataType': data.dtypes
}).reset_index().rename(columns={'index': 'Column Name'})

data_columns_info

#### Raw data

This is an example view of the raw, unprocessed data.


In [None]:
data

#### Key values about the data

This view presents key statistical values for the dataset.
The following metrics are included:

- **Count**
  <br>
  Represents the total count of values in the column.

* **Mean**
  <br>
  Indicates the average value for the column.

* **Std**
  <br>
  Denotes the standard deviation for the column.

* **Min**
  <br>
  Represents the minimum value in the column.

* **25%**
  <br>
  The 25th percentile, also known as the first quartile, indicates the value below which 25% of the data falls.

* **50%**
  <br>
  The 50th percentile, or the median, represents the middle value of the dataset when it is sorted in ascending order.

* **75%**
  <br>
  The 75th percentile, or the third quartile, indicates the value below which 75% of the data falls.

* **Max**
  <br>
  Represents the maximum value in the column.


In [None]:
data.describe().reset_index().rename(columns={'index': 'Key'})

### Cleaning the data

As seen above, there are two columns that are of type 'object.' However, upon inspecting the raw data, it becomes evident that they do not necessarily need to be of type 'object.' Consequently, I am converting the date column to the 'datetime64ns' type and the weather column to the 'string' type.

The rationale behind this decision is to facilitate later stages of the process, making it easier to discern and handle the data within each column.


In [None]:
data['date'] = data['date'].astype("datetime64[ns]")
data['weather'] = data['weather'].astype("string")

#### Raw data (After the cleaning part)

Now, we can observe that the data remains in the same format. This confirmation validates that the switch in data types did not impact the actual data.


In [None]:
data

#### Description of each column (After the cleaning part)

In this section, we can observe the successful execution of our data type conversion. The affected columns now possess the correct data types.

- **Column name**
  <br>
  Provides the name of the column.

* **Non-null count**
  <br>
  Indicates the number of rows with a value in the column.

* **Null count**
  <br>
  DSpecifies the number of rows without any value in the column.

* **Data Type**
  <br>
  Specifies the expected data type for the column.


In [None]:
data_columns_info = pd.DataFrame({
    'Non-Null Count': data.count(),
    'Null Count': data.isnull().sum(),
    'DataType': data.dtypes
}).reset_index().rename(columns={'index': 'Column Name'})

data_columns_info

## Part 2 - Not started yet
