Download the Dataset from the following link:

https://drive.google.com/file/d/1SRutIp484SkfeJFcRkBjHClpylScQn4y/view?usp=sharing

When you load the dataset, you get the following output:

In [3]:
import pandas as pd

In [4]:
data = pd.read_csv("Sharktankpitchesdeals.csv")

In [2]:
data

NameError: name 'data' is not defined

The dataset consists of different features (columns) of 706 pitched business ideas in a very popular TV show in US, called Shark Tank. If we have a look on the column named "Pitched_Business_Desc" which is an abbrevaition for Pitched Business Description, then we get the following:

In [4]:
pitched_business_desc = data['Pitched_Business_Desc']

In [5]:
pitched_business_desc

0      a functional slip worn under a wedding gown th...
1      hair-care products made with pheromones . Laid...
2      a notebook that can scan contents to cloud ser...
3      painting classes with wine served . Wine & Des...
4      a mixing bowl with a built-in scoop . Peoples ...
                             ...                        
701     (Emmy the Elephant during show, trademarked a...
702    a packing and organizing service based on an a...
703    an implantable Bluetooth device requiring surg...
704                                       a pie company 
705    an electronic hand-held device for waiting roo...
Name: Pitched_Business_Desc, Length: 706, dtype: object

This column consists of Business ideas written in detailed text of different pitched business descriptions. Let's have a look on the pitched business description of the first row:

In [6]:
data['Pitched_Business_Desc'].iloc[0]

'a functional slip worn under a wedding gown that allows the wearer to use the restroom on their own . Bridal Buddy is a lightweight slip worn under the gown that lets brides go to the bathroom while wearing it. When nature calls, the bride can bag up her bustle to safely relieve herself without making a mess.'

Similarly, there are pitched business descriptions for all the other 705 business ideas. 

The task is to create a vocabulary of all the words occurring in all the pitched business descriptions. Well, to be more specific and clear, we have to create a list or array of all the unique words occuring in all the pitched business descriptions. As an example, if we have a look on the vocabulary of the above pitched business description of the above description, it will look something like that:

In [7]:
first_string = data['Pitched_Business_Desc'].iloc[0]

In [8]:
first_string_vocab = list(set(first_string.split(" ")))

In [9]:
first_string_vocab

['a',
 'the',
 'calls,',
 'Buddy',
 'bride',
 'Bridal',
 'go',
 'When',
 'restroom',
 'her',
 'use',
 'relieve',
 'functional',
 '.',
 'bustle',
 'it.',
 'without',
 'their',
 'is',
 'brides',
 'mess.',
 'own',
 'lightweight',
 'wearing',
 'bag',
 'lets',
 'safely',
 'wearer',
 'herself',
 'up',
 'nature',
 'gown',
 'wedding',
 'that',
 'allows',
 'to',
 'can',
 'worn',
 'bathroom',
 'under',
 'slip',
 'on',
 'while',
 'making']

Having a look on the vocabulary of the pitched business description of the above business idea, we can see that none of the words are repeating, which means that the pitched business description of the above mentioned business idea is structured from the words in the above shown vocabulary. 

But, there is a problem. If you have a look on some of the words occuring in the vocabulary of this business description, let's say:

"it.", "mess.", "Bridal"

Well, let's see that what is wrong with these words. Well, it may happen that when the vocabulary of another pitched business description is created, then these words might end up appearing once again but slightly in a different form, let's say:

"It.", "mess" and "bridal"

But, when seen carefully, "it." in the first case will be considered different from "It." in the second case because of lower case "i" and upper case "I" in previous and later cases. Similarly, "mess." in the first case will be considered different from "mess" in the second case because of "." at the end of "mess" in the first case. Same is the case with third word, where "Bridal" in the first case will be considered different "bridal" in the second case because of upper case "B" and lower case "b" in the prevous and later cases. 

Let's pick up a word and trying figure out that in how many ways, a word can be different or how many different versions of the word can be so that all of these versions can be considered different inspite of being looking same according to our knowledge of language. Let's pick up the word "bridal". 

"bridal", "Bridal", "BRIDAL", "bridal.", "bridal's", ".Bridal"

As can be noticed, all the above mentioned words are equivalent but they will be considered different. So, how to solve this problem. Well, we have to first solve simple problem of upper and lower cases which is making two similar words considered differently and that can be easily solved by converting each word in each business description first into lower case. After doing this, the above words will look like this: 

"bridal", "bridal", "bridal", "bridal.", "bridal's", ".bridal"

And when we pick out the unique words out of the above list of words, we are left with: 

"bridal", "bridal.", "bridal's", ".bridal"

In [64]:
pitched_business_desc = data['Pitched_Business_Desc'].apply(lambda x: x.lower())

So, we converted all the strings into lower case

In [55]:
pitched_business_desc

0      a functional slip worn under a wedding gown th...
1      hair-care products made with pheromones . laid...
2      a notebook that can scan contents to cloud ser...
3      painting classes with wine served . wine & des...
4      a mixing bowl with a built-in scoop . peoples ...
                             ...                        
701     (emmy the elephant during show, trademarked a...
702    a packing and organizing service based on an a...
703    an implantable bluetooth device requiring surg...
704                                       a pie company 
705    an electronic hand-held device for waiting roo...
Name: Pitched_Business_Desc, Length: 706, dtype: object

So, the problem is solved to some extent. 

Now, to remove special characters in between a word like "'s" or at the end or start of the word like ".", we have to use something called REGULAR EXPRESSION, also known as REGEX in short. 

So, what are Regular Expressions. For this, let's have a look on the following pdf snapshots: 

After gathering knowledge of REGEX from the above notes, we can now straightly convert each pitched business descriptions into filtered words in the description.

In [37]:
# Read Regex from https://www.nltk.org/book/ch03.html

In [38]:
import re

Imported Regular Expressions Library

In [39]:

apostrophe_s_regex = r'''('[a-zA-Z])'''
dot_regex = r'([.?!]+) *'
apostrophe_s_dot_regex = r''''[a-zA-Z]|[.?!]+ *'''

In [40]:
#re.sub(regpattern, "", tempStr)
re.findall(r'''^(.*?)('[a-zA-Z]$)''', "bridal's")

re.findall(r'''('[a-zA-Z])''', "is bridal's buddy")

re.findall(r'''[.]''', "while wearing it. when")


apostrophe_s = r'''('[a-zA-Z])'''
dot_regex = r'([.?!]+) *'

re.findall(apostrophe_s, "is bridal's buddy it. when")

re.findall(dot_regex, "hello world! What? It's nice.")

re.sub(apostrophe_s,"","is bridal's buddy it. when")
re.sub(dot_regex," ","hello world! What? It's nice.")

re.findall(''''[a-zA-Z]|[.?!]+ *''', "hello world! What? It's nice.is bridal's buddy it. when")

re.sub(apostrophe_s_dot_regex," ", "hello world! What? It's nice.is bridal's buddy it. when")

list_with_space = re.sub(''''[a-zA-Z]|[.?!]+ *'''," ", "hello world! What? It's nice.is bridal's buddy it. when").split(" ")
list(filter(None, list_with_space))



[('bridal', "'s")]

["'s"]

['.']

["'s"]

['!', '?', '.']

'is bridal buddy it. when'

"hello world What It's nice "

['! ', '? ', "'s", '.', "'s", '. ']

'hello world What It  nice is bridal  buddy it when'

['hello', 'world', 'What', 'It', 'nice', 'is', 'bridal', 'buddy', 'it', 'when']

In [65]:
pitched_business_desc = pitched_business_desc.apply(lambda x: re.sub(apostrophe_s_dot_regex, "", x))
#pitched_business_desc = pitched_business_desc.apply(lambda x: re.sub(dot_regex, " ", x))


In [66]:
pitched_business_desc

0      a functional slip worn under a wedding gown th...
1      hair-care products made with pheromones laid b...
2      a notebook that can scan contents to cloud ser...
3      painting classes with wine served wine & desig...
4      a mixing bowl with a built-in scoop peoples de...
                             ...                        
701     (emmy the elephant during show, trademarked a...
702    a packing and organizing service based on an a...
703    an implantable bluetooth device requiring surg...
704                                       a pie company 
705    an electronic hand-held device for waiting roo...
Name: Pitched_Business_Desc, Length: 706, dtype: object

In [67]:
pitched_business_desc = pitched_business_desc.apply(lambda x: x.split(" "))
# Remove all whitespace, more on filter from here https://docs.python.org/3/library/functions.html#filter
pitched_business_desc = list(filter(None, pitched_business_desc))

In [44]:
#pitched_business_desc = pitched_business_desc.apply(lambda x: set(x))

In [34]:
#pitched_business_desc = list(pitched_business_desc)

In [68]:
resulting_set = []

for desc in pitched_business_desc:
    
    resulting_set += set(desc)

In [71]:
vocab = set(resulting_set)

In [75]:
len(vocab)


5445