<a href="https://colab.research.google.com/github/fellowship/platform-demos3/blob/master/InteriorDesignClassification/InteriorDesign_MultipleLabels_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a dataframe for multiple labels of Interior Design Image Dataset 

In [0]:
#Importing fastai libraries
from fastai.vision import *

In [2]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Creating a folder in colab memory to store the csv file after copying
!mkdir 'Data'
!cp 'drive/My Drive/scraping_data.csv' 'Data/'

In [4]:
# Reading the first 5 rows of the csv file 
data = pd.read_csv('Data/scraping_data.csv')
data.head()

Unnamed: 0,links,labels
0,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Classic, Traditional Bedroom De..."
1,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Modern, Bohemian, Glam Bedroom ..."
2,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Modern, Bohemian, Glam Bedroom ..."
3,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Farmhouse, Transitional Nursery..."
4,"<meta content=""https://s3.amazonaws.com/havenl...","<meta content=""Online Interior Design And Home..."


**It consists of two columns:**

*   Links : the links for dowloading images 
*   Labels : the labels associated with the images (both single and multi-label)

**This dataframe was created after webscraping the metadata from [Havenly's website](https://havenly.com/interior-design-style-quiz)**

## Data Cleaning

At first we shall split the strings in 'links' column based on " , so that we are left only with website links in the column  

In [0]:
for i in range(len(data['links'])):
  data['links'][i] = data['links'][i].split('"')[1]

We check if a 'NaN' is created or not due to above changes

In [6]:
data['links'].isna().sum()

0

Now we directly remove unnecessary informations from the dataframe.
**First**, the website links in 'links' column which doesn't contains any image. These have 'no-image.png' mentioned in their links. We search them and append those row indices in a list.
**Second**, the labels in 'label' column which doesn't points to any label for its corresponding image. They are simply named as 'Online Interior Design And Home Inspiration | Havenly'. We search them and append their row indices too to the same list.

In [7]:
indices=[]
for i in range(len(data['links'])):
  if data['links'][i].find('no-image.png') > -1:
    indices.append(i)
  if data['labels'][i].find('Online Interior Design And Home Inspiration | Havenly') > -1:
    indices.append(i)
print(len(indices))

3193


3193 such rows!!

We drop these indices directly from the dataframe and assign the dataframe to new one

In [0]:
data_drop =data.drop(indices, axis=0)

As the indices remains as earlier, so we reset the indices to newly created one

In [0]:
data_drop = data_drop.reset_index(drop=True)

**Let's see the new cleaner dataframe**

In [10]:
data_drop.head()

Unnamed: 0,links,labels
0,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Classic, Traditional Bedroom De..."
1,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Modern, Bohemian, Glam Bedroom ..."
2,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Modern, Bohemian, Glam Bedroom ..."
3,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Farmhouse, Transitional Nursery..."
4,https://s3.amazonaws.com/havenly-uploads/prod/...,"<meta content=""Transitional Living Room Design..."


**Still not over. Lets clean the 'labels' column.** 

Let's first understand the labelling style. 

Ex : meta content="Modern, Bohemian, Glam Bedroom Design by Havenly Interior Designer Katrina" property="og:title

We see inside " ", the data is mentioned in structure **Design Label(s)**--**Type of rooms**--**Designer info**. 
As we notice that several images have multiple labels so we need to keep them in our dataframe. Keeping 'type of rooms' information for our classification challenge demands decision out of  domain knowledge; for now we remove it. Finally the designer's information is completely not required; hence removed. 

Apart from this there are few 'labels' which contains just designer's information, for ex: meta content="Design by Havenly Interior Designer Erin" property="og:title"/ . As they are the shortest in length, we remove them first.

This is the list of all labels provided in the metadata.

In [0]:
all_labels = ['Classic','Modern','Glam','Industrial','Traditional', 'Coastal', 'Global', 'Preppy','Rustic',
                'Transitional','Farmhouse','Bohemian', 'Midcentury','Scandinavian','Eclectic','Minimal']

As mentioned above, we clean our dataframe accordingly

In [0]:
import re
for i in range(len(data_drop['labels'])):
  if len(re.findall(r"[\w']+", data_drop['labels'][i]))<12: 
    data_drop = data_drop.drop(i, axis=0)  # We drop the designer's info only rows 
  else:
    data_drop['labels'][i] = re.findall(r"[\w']+", data_drop['labels'][i])[2:-9] # we remove the designers info and begin/end words
    data_drop['labels'][i] = [label for label in data_drop['labels'][i] if label in all_labels] # we remove room info by chosing words from labels provided
    data_drop['labels'][i] = ", ".join(str(label) for label in data_drop['labels'][i]) # we change remove the brackets of the lists

In [0]:
data_drop = data_drop.dropna()

In [0]:
data_drop = data_drop.reset_index(drop=True)

We check if a 'NaN' is created or not due to above changes

In [15]:
data_drop.isna().sum()

links     0
labels    0
dtype: int64

Let's see the final cleaner dataframe

In [16]:
data_drop.head(10)

Unnamed: 0,links,labels
0,https://s3.amazonaws.com/havenly-uploads/prod/...,"Classic, Traditional"
1,https://s3.amazonaws.com/havenly-uploads/prod/...,"Modern, Bohemian, Glam"
2,https://s3.amazonaws.com/havenly-uploads/prod/...,"Modern, Bohemian, Glam"
3,https://s3.amazonaws.com/havenly-uploads/prod/...,"Farmhouse, Transitional"
4,https://s3.amazonaws.com/havenly-uploads/prod/...,Transitional
5,https://s3.amazonaws.com/havenly-uploads/prod/...,"Coastal, Traditional"
6,https://s3.amazonaws.com/havenly-uploads/prod/...,"Farmhouse, Transitional"
7,https://s3.amazonaws.com/havenly-uploads/prod/...,Glam
8,https://s3.amazonaws.com/havenly-uploads/prod/...,Glam
9,https://s3.amazonaws.com/havenly-uploads/prod/...,"Classic, Eclectic"


The number of rows reduced after further cleaning

In [17]:
len(data),len(data_drop)

(24466, 20850)

The unique grouping of multiple labels reduced significantly

In [18]:
data['labels'].nunique(),data_drop['labels'].nunique()

(9570, 738)

## Replace links with serial image numbers and Download the images

In [0]:
# Create a directory to load images
os.mkdir('Data/train_images1')

Download the images via provided links in 'links' column and save it in data folder

In [20]:
# install wget
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [0]:
# Define the dowloader function
import wget
from tqdm import tqdm

def download_images():
  for i in tqdm(range(15198,len(data_drop))):
    url = data_drop['links'][i]
    wget.download(url, 'Data/train_images1/img'+str(i)+'.jpg')

In [0]:
# Call the downloader function
download_images()

While downloading images we still had few links which gave us HTTP error so we remove those links specifically form our dataset.

In [29]:
image_name = ['img'+str(i)+'.jpg' for i in range(len(data_drop))]
final_data = pd.DataFrame(columns=['images', 'class'])
final_data['images'] = image_name
final_data['class'] = data_drop['labels']
final_data.head(50)

Unnamed: 0,images,class
0,img0.jpg,"Classic, Traditional"
1,img1.jpg,"Modern, Bohemian, Glam"
2,img2.jpg,"Modern, Bohemian, Glam"
3,img3.jpg,"Farmhouse, Transitional"
4,img4.jpg,Transitional
5,img5.jpg,"Coastal, Traditional"
6,img6.jpg,"Farmhouse, Transitional"
7,img7.jpg,Glam
8,img8.jpg,Glam
9,img9.jpg,"Classic, Eclectic"


In [0]:
indices = [15196, 15197]
final_data = final_data.drop(indices, axis=0)
final_data = final_data.reset_index(drop=True)
final_data.head(30)

We still have images with no labels. We will select those images and remove them from the dataset and drive

In [0]:
empty_labels=[]
indices=[]
for i in range(len(final_data)):
  if final_data['class'][i]=='':
    empty_labels.append(final_data['images'][i])
    indices.append(i)

In [97]:
len(empty_labels)

751

751 of them needs to be removed.

In [0]:
for img in empty_labels:
  image = 'Data/train_images/'+img
  os.remove(image)

In [0]:
final_data =final_data.drop(indices, axis=0)

In [0]:
final_data = final_data.reset_index(drop=True)

In [121]:
len(final_data)

20097

Final number of images!!

Created a final 'ready to work with' dataframe and dataset

## Export the dataframe to drive

In [123]:
final_data.head(50)

Unnamed: 0,images,class
0,img0.jpg,"Classic, Traditional"
1,img1.jpg,"Modern, Bohemian, Glam"
2,img2.jpg,"Modern, Bohemian, Glam"
3,img3.jpg,"Farmhouse, Transitional"
4,img4.jpg,Transitional
5,img5.jpg,"Coastal, Traditional"
6,img6.jpg,"Farmhouse, Transitional"
7,img7.jpg,Glam
8,img8.jpg,Glam
9,img9.jpg,"Classic, Eclectic"


In [0]:
final_data.to_csv('drive/My Drive/fellowship/final_data.csv', index=False)

## Export the data folder to drive

In [128]:
!pip install PyDrive

Collecting PyDrive
[?25l  Downloading https://files.pythonhosted.org/packages/52/e0/0e64788e5dd58ce2d6934549676243dc69d982f198524be9b99e9c2a4fd5/PyDrive-1.3.1.tar.gz (987kB)
[K     |████████████████████████████████| 993kB 3.5MB/s 
Building wheels for collected packages: PyDrive
  Building wheel for PyDrive (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/fa/d2/9a/d3b6b506c2da98289e5d417215ce34b696db856643bad779f4
Successfully built PyDrive
Installing collected packages: PyDrive
Successfully installed PyDrive-1.3.1


In [0]:
#Using PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# Make a folder for creating a tar file
import os
import shutil
src = 'Data/train_images1'
dest = 'Data/train_images'
src_files = os.listdir(src)
for file_name in src_files:
    full_file_name = os.path.join(src, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, dest)

In [132]:
len(os.listdir(dest))

20097

 Final number of images!!

In [0]:
# create a tarfile
import tarfile,re,os

def make_tarfile(output_filename, source_dir):
  with tarfile.open(output_filename, "w:gz") as tar:
    tar.add(source_dir, arcname=os.path.basename(source_dir))

In [0]:
make_tarfile('interior_data.tar', dest)

In [0]:
# Create GoogleDrive instance with authenticated GoogleAuth instance.
drive = GoogleDrive(gauth)

def upload_to_drive(file, title):
  uploaded = drive.CreateFile({'title': title})
  uploaded.SetContentFile(file)
  uploaded.Upload()
  print('Uploaded file %s with ID %s'%(file, uploaded.get('id')))

In [137]:
upload_to_drive("interior_data.tar","interior_data.tar")

Uploaded file interior_data.tar with ID 1ntCQsAtqmvQz905uvnmLBEqyQoM6H_hT
