## Flipkart_Amazon_Matching Problem

For this problem, I will be approaching it using NLP (Natural language processing). While downloading the pictures of each product and clustering similar pictures would work and likely be more accurate, it is more computationally expensive. This is why I will be opting out for a more strightforward and simple process. 

In [29]:
#Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer as tf_idf
import nltk
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher

## Cleaning

In [2]:
#loading necessary dataframes
amazon = pd.read_csv('amz_com-ecommerce_sample.csv')
flipkart = pd.read_csv('flipkart_com-ecommerce_sample.csv')

In [3]:
amazon.columns

Index(['uniq_id', 'crawl_timestamp', 'product_url', 'product_name',
       'product_category_tree', 'pid', 'retail_price', 'discounted_price',
       'image', 'is_FK_Advantage_product', 'description', 'product_rating',
       'overall_rating', 'brand', 'product_specifications'],
      dtype='object')

In [4]:
flipkart.columns #both of their columns have identical columns

Index(['uniq_id', 'crawl_timestamp', 'product_url', 'product_name',
       'product_category_tree', 'pid', 'retail_price', 'discounted_price',
       'image', 'is_FK_Advantage_product', 'description', 'product_rating',
       'overall_rating', 'brand', 'product_specifications'],
      dtype='object')

In [5]:
#introducing a new column which shows if an item purchase was from amazon or flipkart
amazon['flipkart or amazon']='amazon'
flipkart['flipkart or amazon']='flipkart'
dfs=[flipkart,amazon]
df=pd.concat(dfs)

In [6]:
#dropping unnecessary columns
df = df[['product_name','description','product_specifications','retail_price','discounted_price','flipkart or amazon']]

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   product_name            40000 non-null  object 
 1   description             39996 non-null  object 
 2   product_specifications  39972 non-null  object 
 3   retail_price            39922 non-null  float64
 4   discounted_price        39922 non-null  float64
 5   flipkart or amazon      40000 non-null  object 
dtypes: float64(2), object(4)
memory usage: 2.1+ MB


In [8]:
df.dropna(inplace=True) #dropping missing values

In [9]:
df.drop_duplicates(inplace=True) #dropping duplicated values

In [10]:
df.shape

(39538, 6)

In [11]:
df.tail()

Unnamed: 0,product_name,description,product_specifications,retail_price,discounted_price,flipkart or amazon
19995,WALLDESIGN SMALL VINYL STICKER,Buy WallDesign Small Vinyl Sticker for Rs.730 ...,"{""product_specification""=>[{""key""=>""Number of ...",1498.0,876.0,amazon
19996,WALLMANTRA LARGE VINYL STICKERS STICKER,Buy Wallmantra Large Vinyl Stickers Sticker fo...,"{""product_specification""=>[{""key""=>""Number of ...",1415.0,1424.0,amazon
19997,ELITE COLLECTION MEDIUM ACRYLIC STICKER,Buy Elite Collection Medium Acrylic Sticker fo...,"{""product_specification""=>[{""key""=>""Number of ...",1284.0,1196.0,amazon
19998,ELITE COLLECTION MEDIUM ACRYLIC STICKER,Buy Elite Collection Medium Acrylic Sticker fo...,"{""product_specification""=>[{""key""=>""Number of ...",1492.0,1364.0,amazon
19999,ELITE COLLECTION MEDIUM ACRYLIC STICKER,Buy Elite Collection Medium Acrylic Sticker fo...,"{""product_specification""=>[{""key""=>""Number of ...",1484.0,1247.0,amazon


## Preprocessing

In [12]:
#converting the two neccessary columns to list
df['product_name']=df['product_name'].apply(lambda x:x.split(' '))
df['description']=df['description'].apply(lambda x:x.split(' '))

In [13]:
#removing spaces
df['description']=df['description'].apply(lambda x: [i.replace(" ","") for i in x]) 
df['product_name']=df['product_name'].apply(lambda x: [i.replace(" ","") for i in x]) 

In [14]:
#creating new column which consists of product's desciption and product's name
df['Total']=df['description'] + df['product_name']

In [15]:
#converting lists back to string
df['product_name']=df['product_name'].apply(lambda x: ' '.join(x))
df['Total']=df['Total'].apply(lambda x: ' '.join(x))

In [16]:
df=df.reset_index()

In [17]:
df=df.drop(['index'],axis=1)

In [18]:
def get_wordnet_pos(word):
    """
    This funtion uses the pos_tag to return the part of speech of words in sentences.
    
    word: string
    The word to check for the part of speech
    
    return
    -------
    returns the part of speech of word. 
    """
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [19]:
def lemmatizer_(word):
    """
    This funtion uses the lemmatizes words while considering their part of speech
    
    word: string
    The word to lemmatize
    
    return
    -------
    returns the lemmatized version of the word
    """
    lemmatizer=WordNetLemmatizer()
    return lemmatizer.lemmatize(word,get_wordnet_pos(word))

In [20]:
df['Total']=df['Total'].apply(lemmatizer_) 

In [21]:
#A tf_idf vectorizer is used to convert words to vectors based on their frequency and importance.
#Fit transform will basically fit the words into what we've set for tfidf and the toarray returns an array of vectors

vectorizer=tf_idf(max_features=3000,stop_words='english',lowercase=True) 

In [22]:
vectors=vectorizer.fit_transform(df['Total']).toarray() 

In [23]:
vectors.shape

(39538, 3000)

The cosine similarity will measure the similarity between each feature words and product title. Indices of Products with highest similarity will be returned.

In [24]:
similarity=cosine_similarity(vectors) 

In [25]:
#for example the similarity of product with index three sorted in decending order
sorted(similarity[3],reverse=True)

[1.0000000000000002,
 1.0000000000000002,
 0.9960595965945925,
 0.9960595965945925,
 0.9934102286212935,
 0.9934102286212935,
 0.9889875560043824,
 0.9889875560043824,
 0.988012796234811,
 0.988012796234811,
 0.9763715587826516,
 0.9763715587826516,
 0.7303706374798029,
 0.7303706374798029,
 0.7303706374798029,
 0.7303706374798029,
 0.7294728410268164,
 0.7294728410268164,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6886734506736,
 0.6883786018217896,
 0.6883786018217896,
 0.6883786018217896,
 0.6883786018217896,
 0.6882892177788025,
 0.6882892177788025,
 0.6882870603352397,
 0.6882870603352397,
 0.6877847878380943,
 0.6877847878380943,
 0.687650545551997,
 0.687650545551997,
 0.6875614447807052,
 0.6875614447807052,
 0.6875592941735037,
 0.6875592941735037,
 0.6871475174838687,
 0.6871475174838687,
 0.6868546234004307,
 0.6868546234004307,
 0.6359360925395592,
 0.6359360925395592,
 0.5701073987763596,

In [26]:
def search(word,product_name):
    """
    This funtion checks for the similarity between two words and return the word if the similarity is greater than 0.8
    word: A string
    word 'A' to be compared
    
    product_name: string
    word 'B' to compare word 'A' with
    
    return
    -------
    A word 'A' if word 'A' and 'B' ae 0.8 similar
    """
    if SequenceMatcher(a = word.lower(),b = product_name.lower()).ratio() > 0.8:
        return word

In [27]:
def match(Product_name):
    """
    This funtion matches a name with another in the dataframe that has the highest similarity to it. 
    Product_name: string
    Name of product to be matched
    
    return
    -------
    A dataframe showing the the prices in both amazon and flipkart and the names in both places.
    """
    Product_name = Product_name
    df_= pd.DataFrame() # a temp dataframe that holds result of the search function
    df_['product_name'] = df['product_name'].apply(search,product_name=Product_name) #applying search funtion
    df_['flipkart or amazon']=df['flipkart or amazon']
    df_.dropna(axis=0,inplace=True)
    df_new=pd.DataFrame()
    try: #incase the product does not exist in flipkart
        if Product_name in list(df_['product_name'].where(df_['flipkart or amazon']=='flipkart').dropna()):
            product_index_flipkart=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='flipkart').dropna().index[0]
        else:
            Product_name = list(df_['product_name'].where(df['flipkart or amazon']=='flipkart').dropna())[0]
            product_index_flipkart=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='flipkart').dropna().index[0]
    except:
        df_new['Product name on Flipkart'] = ['Product Not in flipkart']
        df_new['Retail Price in Flipkart'] = ['Product Not in flipkart']
        df_new['Discounted Price in Flipkart'] = ['Product Not in flipkart']
        try: #in case the product doesn't exist in amazon
            if Product_name in list(df_['product_name'].where(df_['flipkart or amazon']=='amazon').dropna()):
                product_index_amazon=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='amazon').dropna().index[0]
            else:
                Product_name = list(df_['product_name'].where(df['flipkart or amazon']=='amazon').dropna())[0]
                product_index_amazon=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='amazon').dropna().index[0]
        except:
            df_new['Product name on Amazon'] = ['Product Not in amazon']
            df_new['Retail Price in Amazon'] = ['Product Not in amazon']
            df_new['Discounted Price in Amazon'] = ['Product Not in amazon']
        else:
            distances_amazon =similarity[product_index_amazon]
            amazon_product=sorted(list(enumerate(distances_amazon)),reverse=True,key=lambda x:x[1])[0]
            df_new['Product name on Amazon'] = [df.iloc[amazon_product[0]].product_name]
            df_new['Retail Price in Amazon'] = [df.iloc[amazon_product[0]].retail_price]
            df_new['Discounted Price in Amazon'] = [df.iloc[amazon_product[0]].discounted_price]
    else:
        try: #in case the product doesn't exist in amazon
            if Product_name in list(df_['product_name'].where(df_['flipkart or amazon']=='amazon').dropna()):
                product_index_amazon=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='amazon').dropna().index[0]
            else:
                Product_name = list(df_['product_name'].where(df['flipkart or amazon']=='amazon').dropna())[0]
                product_index_amazon=df[df['product_name'] == Product_name].where(df['flipkart or amazon']=='amazon').dropna().index[0]
        except:
            product_index_flipkart=df[df['product_name']== Product_name].where(df['flipkart or amazon']=='flipkart').dropna().index[0]
            df_new['Product name on Amazon'] = ['Product Not in amazon']
            df_new['Retail Price in Amazon'] = ['Product Not in amazon']
            df_new['Discounted Price in Amazon'] = ['Product Not in amazon']
            distances_flipkart=similarity[product_index_flipkart]
            flipkart_product=sorted(list(enumerate(distances_flipkart)),reverse=True,key=lambda x:x[1])[0]
            df_new['Product name on Flipkart'] = [df.iloc[flipkart_product[0]].product_name]
            df_new['Retail Price in Flipkart'] = [df.iloc[flipkart_product[0]].retail_price]
            df_new['Discounted Price in Flipkart'] = [df.iloc[flipkart_product[0]].discounted_price]
        else: 
            distances_amazon =similarity[product_index_amazon] 
            distances_flipkart=similarity[product_index_flipkart]
            flipkart_product=sorted(list(enumerate(distances_flipkart)),reverse=True,key=lambda x:x[1])[0]
            amazon_product=sorted(list(enumerate(distances_amazon)),reverse=True,key=lambda x:x[1])[0]
            df_new['Product name on Flipkart'] = [df.iloc[flipkart_product[0]].product_name]
            df_new['Retail Price in Flipkart'] = [df.iloc[flipkart_product[0]].retail_price]
            df_new['Discounted Price in Flipkart'] = [df.iloc[flipkart_product[0]].discounted_price]
            df_new['Product name on Amazon'] = [df.iloc[amazon_product[0]].product_name]
            df_new['Retail Price in Amazon'] = [df.iloc[amazon_product[0]].retail_price]
            df_new['Discounted Price in Amazon'] = [df.iloc[amazon_product[0]].discounted_price]    
    return df_new

In [30]:
match("FDT Women's Leggings")

Unnamed: 0,Product name on Flipkart,Retail Price in Flipkart,Discounted Price in Flipkart,Product name on Amazon,Retail Price in Amazon,Discounted Price in Amazon
0,FDT Women's Leggings,699.0,309.0,FDT WOMEN'S Leggings Pants,698.0,362.0


## In conclusion

The solution to this problem is probably not the most logical but given the number of dataset, training a model with pictures would probably take more time so I recommend this text processing approach. 