# **Getting Started Code

## Download the images
We are given google drive link in the '[Data](https://drive.google.com/file/d/1G_Ix97_JP5NhcmMY74ctA2piDxbiU1Lk/view?usp=sharing)' section of problem page which has all the required train images (to build the model) and test images to predict the label of these images and submit the predictions on the [DPhi platform](https://drive.google.com/file/d/1G_Ix97_JP5NhcmMY74ctA2piDxbiU1Lk/view?usp=sharing).

We can use **GoogleDriveDownloader** form **google_drive_downloader** library in Python to download the shared files from the shared Google drive link: https://drive.google.com/file/d/1G_Ix97_JP5NhcmMY74ctA2piDxbiU1Lk/view?usp=sharing




**This link is not working properly, so i have uploaded the data on my drive and used the mounting drive method to get access to the data.**


Mine drive link

https://drive.google.com/file/d/17B-JQfne4ZX25rubXQocEZLOswedDi6F/view?usp=sharing

The file id in the above link is: **1G_Ix97_JP5NhcmMY74ctA2piDxbiU1Lk**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# from google.colab import files
# uploaded = files.upload()

In [3]:
# !unzip "/content/weather.zip" -d "/content/content/"

We have all the files from the shared Google drive link downloaded in the colab environment.

## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd, tensorlow --> tf).

Note: You can import all the libraries that you think will be required or can import it as you go along.

In [4]:
import pandas as pd                                     # Data analysis and manipultion tool
import numpy as np                                      # Fundamental package for linear algebra and multidimensional arrays
import tensorflow as tf                                 # Deep Learning Tool
import os                                               # OS module in Python provides a way of using operating system dependent functionality
import cv2                                              # Library for image processing
from sklearn.model_selection import train_test_split    # For splitting the data into train and validation set

## Loading and preparing training data
The train and test images are given in two different folders - 'train' and 'test'. The labels of train images are given in a csv file 'Train.csv' with respective image id (i.e. image file name).

#### Getting the labels of the images

In [5]:
# labels = pd.read_csv("/content/content/Training_set.csv") # loading the labels
# labels.head() # will display the first five rows in labels dataframe

In [6]:
labels = pd.read_csv("/content/drive/MyDrive/weather/Training_set.csv") # loading the labels
labels.head() # will display the first five rows in labels dataframe

Unnamed: 0,filename,label
0,Image_1.jpg,sunrise
1,Image_2.jpg,shine
2,Image_3.jpg,cloudy
3,Image_4.jpg,shine
4,Image_5.jpg,sunrise


In [7]:
labels.tail()            # will display the last five rows in labels dataframe

Unnamed: 0,filename,label
1043,Image_1044.jpg,foggy
1044,Image_1045.jpg,sunrise
1045,Image_1046.jpg,cloudy
1046,Image_1047.jpg,rainy
1047,Image_1048.jpg,sunrise


#### Getting images file path

In [8]:
# file_paths = [[fname, '/content/content/train/' + fname] for fname in labels['filename']]

In [9]:
file_paths = [[fname, '/content/drive/MyDrive/weather/train/' + fname] for fname in labels['filename']]

#### Confirming if no. of labels is equal to no. of images

In [10]:
# Confirm if number of images is same as number of labels given
if len(labels) == len(file_paths):
    print('Number of labels i.e. ', len(labels), 'matches the number of filenames i.e. ', len(file_paths))
else:
    print('Number of labels does not match the number of filenames')

Number of labels i.e.  1048 matches the number of filenames i.e.  1048


#### Converting the file_paths to dataframe

In [11]:
images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
images.head()

Unnamed: 0,filename,filepaths
0,Image_1.jpg,/content/drive/MyDrive/weather/train/Image_1.jpg
1,Image_2.jpg,/content/drive/MyDrive/weather/train/Image_2.jpg
2,Image_3.jpg,/content/drive/MyDrive/weather/train/Image_3.jpg
3,Image_4.jpg,/content/drive/MyDrive/weather/train/Image_4.jpg
4,Image_5.jpg,/content/drive/MyDrive/weather/train/Image_5.jpg


#### Combining the labels with the images

In [12]:
train_data = pd.merge(images, labels, how = 'inner', on = 'filename')
train_data.head()       

Unnamed: 0,filename,filepaths,label
0,Image_1.jpg,/content/drive/MyDrive/weather/train/Image_1.jpg,sunrise
1,Image_2.jpg,/content/drive/MyDrive/weather/train/Image_2.jpg,shine
2,Image_3.jpg,/content/drive/MyDrive/weather/train/Image_3.jpg,cloudy
3,Image_4.jpg,/content/drive/MyDrive/weather/train/Image_4.jpg,shine
4,Image_5.jpg,/content/drive/MyDrive/weather/train/Image_5.jpg,sunrise


In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_data['label'] = le.fit_transform(train_data['label'])
print(type(train_data['filepaths']))
print(len(train_data))
print(np.shape(train_data['filepaths'][1]))

<class 'pandas.core.series.Series'>
1048
()


The 'train_data' dataframe contains all the image id, their locations and their respective labels. Now the training data is ready.

## Data Pre-processing
It is necessary to bring all the images in the same shape and size, also convert them to their pixel values because all machine learning or deep learning models accepts only the numerical data. Also we need to convert all the labels from categorical to numerical values.

In [14]:
#data Preprocessing
data = [] # initialize an empty numpy array
image_size = 100 # image size taken is 100 here. one can take other size too
for i in range(len(train_data)):
   img_array = cv2.imread(train_data['filepaths'][i], cv2.IMREAD_GRAYSCALE) # converting the image to gray scale
   try:
           img = cv2.resize(img_array, (image_size, image_size), interpolation=cv2.INTER_AREA)      
   except:
     break
   data.append([img, train_data['label'][i]])

# image pixels of a image
len(data)

1048

In [15]:
data[5]

[array([[129, 130, 131, ..., 129, 128, 127],
        [129, 130, 131, ..., 129, 129, 128],
        [129, 130, 131, ..., 130, 130, 129],
        ...,
        [ 14,  14,  14, ...,  18,  18,  18],
        [ 17,  13,  14, ...,  17,  17,  17],
        [ 24,  11,  14, ...,  16,  17,  17]], dtype=uint8), 0]

#### Shuffle the data

In [16]:
np.random.shuffle(data)

#### Separating the images and labels


In [17]:
x = []
y = []
for image in data:
  x.append(image[0])
  y.append(image[1])

# converting x & y to numpy array as they are list
x = np.array(x)
y = np.array(y)

In [18]:
np.unique(y, return_counts=True)

(array([0, 1, 2, 3, 4]), array([210, 210, 209, 174, 245]))

In [19]:
len(x)

1048

In [20]:
len(y)

1048

#### Splitting the data into Train and Validation Set
We want to check the performance of the model that we built. For this purpose, we always split (both independent and dependent data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.

For this purpose we have a class called 'train_test_split' in the 'sklearn.model_selection' module.

In [21]:
x =  x.reshape(-1, 100, 100, 1)

In [22]:
# split the data
X_train, X_val, y_train, y_val = train_test_split(x,y,test_size=0.25, random_state = 42)

## Building Model
Now we are finally ready, and we can train the model.

There are many machine learning or deep learning models like Random Forest, Decision Tree, Multi-Layer Perceptron (MLP), Convolution Neural Network (CNN), etc. to say you some.


Then we would feed the model both with the data (X_train) and the answers for that data (y_train)

In [23]:
cnn = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', input_shape=(100, 100, 1)),
    tf.keras.layers.MaxPooling2D((2, 2)),

    tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    
    tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
        
    # tf.keras.layers.Flatten(input_shape=(100, 100, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(rate=0.3),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(25, activation='relu'),
    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])

In [24]:
cnn.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [25]:
cnn.fit(X_train, y_train, epochs=100, batch_size=15)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fccff5f7d10>

## Validate the model
Wonder🤔 how well your model learned! Lets check its performance on the X_val data.

In [26]:
cnn.evaluate(X_val, y_val)



[2.6239144802093506, 0.5916030406951904]

## Predict The Output For Testing Dataset 😅
We have trained our model, evaluated it and now finally we will predict the output/target for the testing data (i.e. Test.csv).

#### Load Test Set
Load the test data on which final submission is to be made.

In [27]:
# Loading the order of the image's name that has been provided
test_image_order = pd.read_csv("/content/drive/MyDrive/weather/Testing_set.csv")
test_image_order.head()

Unnamed: 0,filename
0,Image_1.jpg
1,Image_2.jpg
2,Image_3.jpg
3,Image_4.jpg
4,Image_5.jpg


#### Getting images file path

In [28]:
file_paths = [[fname, '/content/drive/MyDrive/weather/test/' + fname] for fname in test_image_order['filename']]

#### Confirm if number of images in test folder is same as number of image names in 'Testing_set_face_mask.csv'

In [29]:
# Confirm if number of images is same as number of labels given
if len(test_image_order) == len(file_paths):
    print('Number of image names i.e. ', len(test_image_order), 'matches the number of file paths i.e. ', len(file_paths))
else:
    print('Number of image names does not match the number of filepaths')

Number of image names i.e.  450 matches the number of file paths i.e.  450


#### Converting the file_paths to dataframe

In [30]:
test_images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
test_images.head()

Unnamed: 0,filename,filepaths
0,Image_1.jpg,/content/drive/MyDrive/weather/test/Image_1.jpg
1,Image_2.jpg,/content/drive/MyDrive/weather/test/Image_2.jpg
2,Image_3.jpg,/content/drive/MyDrive/weather/test/Image_3.jpg
3,Image_4.jpg,/content/drive/MyDrive/weather/test/Image_4.jpg
4,Image_5.jpg,/content/drive/MyDrive/weather/test/Image_5.jpg


## Data Pre-processing on test_data


In [31]:
test_pixel_data = []     # initialize an empty numpy array
image_size = 100      # image size taken is 100 here. one can take other size too
for i in range(len(test_images)):
  
  img_array = cv2.imread(test_images['filepaths'][i], cv2.IMREAD_GRAYSCALE)   # converting the image to gray scale

  new_img_array = cv2.resize(img_array, (image_size, image_size))      # resizing the image array

  test_pixel_data.append(new_img_array)

In [32]:
test_pixel_data = np.array(test_pixel_data)

In [33]:
test_pixel_data =  test_pixel_data.reshape(-1, 100, 100, 1)

### Make Prediction on Test Dataset
Time to make a submission!!!

In [34]:
pred = cnn.predict(test_pixel_data)

In [35]:
# The predicted values are the probabilities value
pred[0]

array([0.00000000e+00, 1.03734865e-35, 1.00000000e+00, 3.78727594e-13,
       0.00000000e+00], dtype=float32)

The above values are probability values. We need to convert it into respective classes. We can use np.argmax for the same.

In [36]:
prediction = []
for value in pred:
  prediction.append(np.argmax(value))

In [37]:
predictions = le.inverse_transform(prediction)

## **How to save prediciton results locally via jupyter notebook?**
If you are working on Jupyter notebook, execute below block of codes. A file named 'submission.csv' will be created in your current working directory.

In [38]:
res = pd.DataFrame({'filename': test_images['filename'], 'label': predictions})  # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv", index = False)      # the csv file will be saved locally on the same location where this notebook is located.

# **OR,**
**If you are working on Google Colab then use the below set of code to save prediction results locally**

## **How to save prediction results locally via colab notebook?**
If you are working on Google Colab Notebook, execute below block of codes. A file named 'prediction_results' will be downloaded in your system.

In [39]:
res = pd.DataFrame({'filename': test_images['filename'], 'label': predictions})  # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv", index = False) 

# To download the csv file locally
from google.colab import files        
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Well Done! 👍**
You are all set to make a submission. Let's head to the **[challenge page](https://dphi.tech/challenges/data-sprint-41/142/submit)** to make the submission.