<a href="https://colab.research.google.com/github/fellowship/platform-demos3/blob/master/InteriorDesignClassification/Interior_Design_MultipleLabels_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a dataframe for multiple labels 

In [0]:
#Importing fastai libraries
from fastai.vision import *

In [2]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Creating a folder in colab memory to store the csv file after copying
!mkdir 'Data'
!cp 'drive/My Drive/scraping_data.csv' 'Data/'

In [6]:
# Reading the first 5 rows of the csv file 
data = pd.read_csv('Data/scraping_data.csv')
data.head()

Unnamed: 0,links,labels
0,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Classic, Traditional Bedroom De..."
1,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Modern, Bohemian, Glam Bedroom ..."
2,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Modern, Bohemian, Glam Bedroom ..."
3,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Farmhouse, Transitional Nursery..."
4,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Online Interior Design And Home..."


**It consists of two columns:**

*   Links : the links for dowloading images 
*   Labels : the labels associated with the images (both single and multi-label)

**This dataframe was created after webscraping the metadata from [Havenly's website](https://havenly.com/interior-design-style-quiz)**

## Data Cleaning

At first we shall split the strings in 'links' column based on " , so that we are left only with website links in the column  

In [0]:
for i in range(len(data['links'])):
  data['links'][i] = data['links'][i].split('"')[1]

We check if a 'NaN' is created or not due to above changes

In [8]:
data['links'].isna().sum()

0

Now we directly remove unnecessary informations from the dataframe.
**First**, the website links in 'links' column which doesn't contains any image. These have 'no-image.png' mentioned in their links. We search them and append those row indices in a list.
**Second**, the labels in 'label' column which doesn't points to any label for its corresponding image. They are simply named as 'Online Interior Design And Home Inspiration | Havenly'. We search them and append their row indices too to the same list.

In [9]:
indices=[]
for i in range(len(data['links'])):
  if data['links'][i].find('no-image.png') > -1:
    indices.append(i)
  if data['labels'][i].find('Online Interior Design And Home Inspiration | Havenly') > -1:
    indices.append(i)
print(len(indices))

3193


3193 such rows!!

We drop these indices directly from the dataframe and assign the dataframe to new one

In [0]:
data_drop =data.drop(indices, axis=0)

As the indices remains as earlier, so we reset the indices to newly created one

In [0]:
data_drop = data_drop.reset_index(drop=True)

**Let's see the new cleaner dataframe**

In [12]:
data_drop.head()

Unnamed: 0,links,labels
0,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Classic, Traditional Bedroom De..."
1,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Modern, Bohemian, Glam Bedroom ..."
2,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Modern, Bohemian, Glam Bedroom ..."
3,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Farmhouse, Transitional Nursery..."
4,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Transitional Living Room Design..."


**Still not over. Lets clean the 'labels' column.** 

Let's first understand the labelling style. 

Ex : meta content="Modern, Bohemian, Glam Bedroom Design by Havenly Interior Designer Katrina" property="og:title

We see inside " ", the data is mentioned in structure **Design Label(s)**--**Type of rooms**--**Designer info**. 
As we notice that several images have multiple labels so we need to keep them in our dataframe. Keeping 'type of rooms' information for our classification challenge demands decision out of  domain knowledge; for now we remove it. Finally the designer's information is completely not required; hence removed. 

Apart from this there are few 'labels' which contains just designer's information, for ex: meta content="Design by Havenly Interior Designer Erin" property="og:title"/ . As they are the shortest in length, we remove them first.

This is the list of all labels provided in the metadata.

In [0]:
all_labels = ['Classic','Modern','Glam','Industrial','Traditional', 'Coastal', 'Global', 'Preppy','Rustic',
                'Transitional','Farmhouse','Bohemian', 'Midcentury','Scandinavian','Eclectic','Minimal']

As mentioned above, we clean our dataframe accordingly

In [0]:
import re
for i in range(len(data_drop['labels'])):
  if len(re.findall(r"[\w']+", data_drop['labels'][i]))<12: 
    data_drop.drop(i, axis=0)  # We drop the designer's info only rows 
  else:
    data_drop['labels'][i] = re.findall(r"[\w']+", data_drop['labels'][i])[2:-9] # we remove the designers info and begin/end words
    data_drop['labels'][i] = [label for label in data_drop['labels'][i] if label in all_labels] # we remove room info by chosing words from labels provided
    data_drop['labels'][i] = ", ".join(str(label) for label in data_drop['labels'][i]) # we change remove the brackets of the lists

We check if a 'NaN' is created or not due to above changes

In [15]:
data_drop.isna().sum() 

links     0
labels    0
dtype: int64

Let's see the final cleaner dataframe

In [16]:
data_drop.head(10)

Unnamed: 0,links,labels
0,https://s3.amazonaws.com/havenly-uploads/prod/...,"Classic, Traditional"
1,https://s3.amazonaws.com/havenly-uploads/prod/...,"Modern, Bohemian, Glam"
2,https://s3.amazonaws.com/havenly-uploads/prod/...,"Modern, Bohemian, Glam"
3,https://s3.amazonaws.com/havenly-uploads/prod/...,"Farmhouse, Transitional"
4,https://s3.amazonaws.com/havenly-uploads/prod/...,Transitional
5,https://s3.amazonaws.com/havenly-uploads/prod/...,"Coastal, Traditional"
6,https://s3.amazonaws.com/havenly-uploads/prod/...,"Farmhouse, Transitional"
7,https://s3.amazonaws.com/havenly-uploads/prod/...,Glam
8,https://s3.amazonaws.com/havenly-uploads/prod/...,Glam
9,https://s3.amazonaws.com/havenly-uploads/prod/...,"Classic, Eclectic"


The number of rows reduced after further cleaning

In [17]:
len(data),len(data_drop)

(24466, 21273)

The unique grouping of multiple labels reduced significantly

In [18]:
data['labels'].nunique(),data_drop['labels'].nunique()

(9570, 773)

## Export the dataframe to drive

In [0]:
data_drop.to_csv('drive/My Drive/ready_data.csv', index=False)