<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-necessary-modules" data-toc-modified-id="Import-necessary-modules-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import necessary modules</a></span></li><li><span><a href="#Import-the-dataset" data-toc-modified-id="Import-the-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import the dataset</a></span></li><li><span><a href="#Some-checks-and-processing-(Rule-Based)" data-toc-modified-id="Some-checks-and-processing-(Rule-Based)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Some checks and processing (Rule Based)</a></span><ul class="toc-item"><li><span><a href="#Rules" data-toc-modified-id="Rules-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Rules</a></span><ul class="toc-item"><li><span><a href="#Remove-all-the-rows-where-the-product-name-was-missing---there-were-2-such-rows" data-toc-modified-id="Remove-all-the-rows-where-the-product-name-was-missing---there-were-2-such-rows-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Remove all the rows where the product name was missing - there were 2 such rows</a></span></li><li><span><a href="#Using-fuzzy-matching-to-find-the-most-resembling-brand-name-from-the-product-name-categories" data-toc-modified-id="Using-fuzzy-matching-to-find-the-most-resembling-brand-name-from-the-product-name-categories-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Using fuzzy matching to find the most resembling brand name from the product name categories</a></span></li></ul></li></ul></li><li><span><a href="#Basic-Analysis-Based-on-Ratings-of-different-brands" data-toc-modified-id="Basic-Analysis-Based-on-Ratings-of-different-brands-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic Analysis Based on Ratings of different brands</a></span></li></ul></div>

### Import necessary modules

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from fuzzywuzzy import fuzz

%matplotlib inline



### Import the dataset

In [2]:
df = pd.read_csv("input/raw_data.csv")

### Some checks and processing (Rule Based)

#### Rules
1. Remove all the reviews which does not have any products associated with them
2. For the rows where the brands are missing- impute the brand names by the generic version of the product names
2. Checking the number of brands which has less than 3 reviews - will be removing them because such low number of reviews wouldn't matter


In [3]:
df2 = df.copy(deep=True)

In [4]:
df2['product'].isnull().sum()

2

##### Remove all the rows where the product name was missing - there were 2 such rows

In [5]:
df2 = df2.dropna(axis=0, subset=['product'])
assert df2['product'].isnull().sum() == 0

In [35]:
original_brand_names = df['brand'].dropna().unique().tolist()

##### Using fuzzy matching to find the most resembling brand name from the product name categories

In [64]:
# first extracting the most relevant keyword from the product which is generally in the first string
s = 'mikes-hard-blackcherry-235oz-can'
checking_str = s.split('-')[0].lower()
print(checking_str)

# selecting the brand which has the most resemblance with the existing brand names
probable_matches_dict = {}
for brand_name in original_brand_names:
    if checking_str in brand_name.lower():
        probable_matches_dict[brand_name] = fuzz.partial_token_sort_ratio(brand_name, s)

probable_matches_dict

mikes


{}

In [51]:
most_probable_match = max(probable_matches_dict, key=probable_matches_dict.get)
most_probable_match

'Stone'

In [61]:
def find_probable_brand(s):
    # first extracting the most relevant keyword from the product which is generally in the first string
    checking_str = s.split('-')[0].lower()
    # print(checking_str)

    # selecting the brand which has the most resemblance with the existing brand names
    probable_matches_dict = {}
    for brand_name in original_brand_names:
        if checking_str in brand_name.lower():
            # print(f"Probable comparisons: {s} with {brand_name}")
            probable_matches_dict[brand_name] = fuzz.partial_token_sort_ratio(brand_name, s)

    print(probable_matches_dict)
    # return the brand name with the highest fuzzy score
    most_probable_match = max(probable_matches_dict, key=probable_matches_dict.get)
    return most_probable_match

In [63]:
ans = find_probable_brand(s='mikes-hard-blackcherry-235oz-can')
ans

{}


ValueError: max() arg is an empty sequence

In [18]:
df2['brand'] = df2.apply(lambda x: x['product'].split('-')[0].title() if pd.isnull(x['brand']) else x['brand'], axis=1)

In [22]:
df[df['brand'].isnull()]

Unnamed: 0,id,content,date,product,brand,rating
13,3269,I love the flavor of this drink ! We usually e...,2021-12-06 03:46:00,mikes-hard-blackcherry-235oz-can,,4.0
19,3270,I normally love IPA but this beer was a tad to...,2021-11-29 06:11:00,stone-delicious-ipa-62,,4.0
21,3276,I am like many other beer enthusiasts and I li...,2021-11-17 04:10:00,stone-delicious-ipa-62,,2.0
25,2661,This is one of the cleanest tasting domestic b...,2021-11-13 03:46:00,michelob-ultra-pure-gold-superior-light-beer-1,,5.0
28,3267,My all - time favorite is corona light . This ...,2021-11-09 04:17:00,stone-delicious-ipa-62,,3.0
...,...,...,...,...,...,...
5813,6289,Bottle . Pours dark garnet - colored with a sm...,2012-04-16,wild-black,,3.0
5840,2556,"Appearance: Pours out a clear , yellow body wi...",2012-03-26 00:00:00,bud-light-platinum,,2.0
6072,6283,Bottle . Pours a deep ruby color with a thin t...,2011-11-20,beck-s-dark,,2.0
6487,2561,"Reviewed from notes , although its hard to for...",2011-02-16 00:00:00,budweiser-select,,0.0


### Basic Analysis Based on Ratings of different brands