# User-Artists

<b>TASK:</b> Given User requirements (Artist Category, Type of Event, Budget, Location and Gathering Size) predict type of <i>Artists</i> that would be pitched to the Users to capture their requirements accurately. 

In this Notebook I have attempted to GET Similar Users based on his/her requirements.<br>
<i>Method Used:</i> TF-IDF (Term Frequency - Inverse Document Frequency) on Category, Event and Location of the Event

## Library

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

### Dataset
The Dataset is provided by [StarClinch](https://starclinch.com).

Information about the given Dataset:

<i><b>asid:</b></i>ID of the ROW<br>                         
<i><b>created:</b></i>Date + Time of the creation.<br>               
                                                                             
<i><b>dealid:</b></i>(Unique) ID of the DEAL <br>                   
<i><b>dealowner:</b></i> Name of the Person the deal is assigned to <br>
                                                                             
<i><b>clientname:</b></i>Name of the User (client) <br>                 
<i><b>location:</b></i>Location of the EVENT <br>                     
<i><b>categoryname:</b></i>Artists Category. (14 Distinct Categories)<br> 
<i><b>eventname:</b></i>Type of Event. (14 Distinct Event Types)<br>   
<i><b>budget:</b></i>Budget for the Event given by User.<br>         
<i><b>gathering:</b></i>Gathering Size<br>                             
<i><b>date:</b></i>Event Date<br>                                 
                                                                             
<i><b>artistsurls:</b></i>URLS of the artists pitched to the User<br>    
<i><b>artists:</b></i>ID of the artists pitched<br>                  
                                                                             
<i><b>lookingfor</b></i> is mostly empty as it signifies whether the user have already specified which artists he/she |wants in the event.<br>
<i><b>city:</b></i>Value is extracted from location<br>

For List of Artist Categories and Type of Events visit [StarClinch](https://starclinch.com/requirement.html)

In [2]:
# Read the Dataset
df = pd.read_csv('artist-pitch.csv')
df.head()

Unnamed: 0,asid,created,dealid,dealowner,clientname,location,categoryname,eventname,artisturls,artists,budget,gathering,date,lookingfor,city
0,272,5/17/2018 22:02,3255,Shambhavi from StarClinch,Amit,"Nagpur, Maharashtra, India",SINGER,charity,"abhigyan-das, deepak-malik, mohit-gaur, pawand...","15,90,50,15,90,53,15,70,00,00,00,00,00,00,00,0...",200000,2000.0,16-06-2018,,
1,338,5/22/2018 16:43,4995,Shalini from StarClinch,Kavita,"Andheri East, Mumbai, Maharashtra, India",ANCHOR/EMCEE,corporate,"divya-malik, saddvi-bajaj, vandana-bisht, aish...","40,41,30,64,09,72,93,40,00,00,00,00,00,00,00,0...",85000,1000.0,16-11-2018,,Delhi
2,300,5/19/2018 20:03,5814,Shambhavi from StarClinch,Gautam Kapoor,Jim corbett,LIVE BAND,wedding,"anhad-the-band, rangreza-the-band, big-m, mera...",332925197241443000000000000,30000,160.0,08-02-2019,,Delhi
3,268,5/17/2018 17:26,5871,Prasanta From StarClinch,Rajat Aggarwal,"Beauty & Spa, praghati maidan",ANCHOR/EMCEE,exhibition,"anchor-ekta-sharma, anchor-shina-gujral, kanis...","2,43,65,02,72,02,40,56,00,00,00,00,00,00,00,00...",10000,2000.0,28-05-2018,https://starclinch.com/ilia,Delhi
4,320,5/21/2018 19:19,6395,Nitisha from StarClinch,Ajay Kumar dhiman,Chandigarh,PHOTO/VIDEOGRAPHER,wedding,"aj-frames, soulmate-weddings, vivah-moments, s...","2,39,37,22,63,29,34,12,00,00,00,00,00,00,00,00...",150000,250.0,10-11-2018,Harnav Bir Singh Photography,Chandigarh


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 15 columns):
asid            81 non-null int64
created         81 non-null object
dealid          81 non-null int64
dealowner       81 non-null object
clientname      81 non-null object
location        81 non-null object
categoryname    81 non-null object
eventname       81 non-null object
artisturls      81 non-null object
artists         81 non-null object
budget          81 non-null int64
gathering       80 non-null float64
date            81 non-null object
lookingfor      13 non-null object
city            48 non-null object
dtypes: float64(1), int64(3), object(11)
memory usage: 9.6+ KB


``` Gathering have 1 missing value ```

In [14]:
df[df['gathering'].isnull()]

Unnamed: 0,asid,created,dealid,dealowner,clientname,location,categoryname,eventname,artisturls,artists,budget,gathering,date,lookingfor,city
79,387,5/25/2018 17:50,6814,Shambhavi,Anshika Bhatelay,Agra,LIVE BAND,corporate,"delhicious-harmony, ahad-band, medlon, soulful...","2,40,55,32,33,91,62,43,00,00,00,00,00,00,00,00...",35000,,2018-06-28,Live Bands,


In [16]:
# Instead of leaving it as NaN set it to 0, although that's not possible as event cannot have "0" gathering
df['gathering'].fillna(0, inplace = True)

In [4]:
# Converting date to a DateTime Format
df['date'] = pd.to_datetime(df['date'])

In [5]:
# Checking who all pitch artist to the user.
df.dealowner.unique()

array(['Shambhavi from StarClinch', 'Shalini from StarClinch',
       'Prasanta From StarClinch', 'Nitisha from StarClinch',
       'Abhishek from StarClinch', 'Geetika from StarClinch'],
      dtype=object)

In [6]:
# Stripping From StarClinch from the names.
# Now, we only have the names.
df['dealowner'] = df['dealowner'].str.split(expand = True)[0]

In [9]:
# Sorting the Dataframe based on DealID
df.sort_values('dealid', ascending = True, inplace = True)

In [10]:
df.head()

Unnamed: 0,asid,created,dealid,dealowner,clientname,location,categoryname,eventname,artisturls,artists,budget,gathering,date,lookingfor,city
0,272,5/17/2018 22:02,3255,Shambhavi,Amit,"Nagpur, Maharashtra, India",SINGER,charity,"abhigyan-das, deepak-malik, mohit-gaur, pawand...","15,90,50,15,90,53,15,70,00,00,00,00,00,00,00,0...",200000,2000.0,2018-06-16,,
1,338,5/22/2018 16:43,4995,Shalini,Kavita,"Andheri East, Mumbai, Maharashtra, India",ANCHOR/EMCEE,corporate,"divya-malik, saddvi-bajaj, vandana-bisht, aish...","40,41,30,64,09,72,93,40,00,00,00,00,00,00,00,0...",85000,1000.0,2018-11-16,,Delhi
2,300,5/19/2018 20:03,5814,Shambhavi,Gautam Kapoor,Jim corbett,LIVE BAND,wedding,"anhad-the-band, rangreza-the-band, big-m, mera...",332925197241443000000000000,30000,160.0,2019-08-02,,Delhi
3,268,5/17/2018 17:26,5871,Prasanta,Rajat Aggarwal,"Beauty & Spa, praghati maidan",ANCHOR/EMCEE,exhibition,"anchor-ekta-sharma, anchor-shina-gujral, kanis...","2,43,65,02,72,02,40,56,00,00,00,00,00,00,00,00...",10000,2000.0,2018-05-28,https://starclinch.com/ilia,Delhi
4,320,5/21/2018 19:19,6395,Nitisha,Ajay Kumar dhiman,Chandigarh,PHOTO/VIDEOGRAPHER,wedding,"aj-frames, soulmate-weddings, vivah-moments, s...","2,39,37,22,63,29,34,12,00,00,00,00,00,00,00,00...",150000,250.0,2018-10-11,Harnav Bir Singh Photography,Chandigarh


## Splitting the DataSet
Split the dataset into PYR Information i.e Information that user fills in <i>Post Your Requirement</i>
    
    Name of the Client, Type of Artist, Event Type, Venue, Date, Gathering Size & Budget 
    
& Information of artist pitched to the user by the Deal Owner.

    Name of the Dealer, ID Artists Pitched & URL of Artists Pitched.
    
Both have <i>asid</i> & <i>dealid</i> as the <b>KEY</b> to cross-check b/w data.

In [17]:
# PYR and Deal Features
PYR_Feat = ['asid', 'dealid', 'clientname', 'categoryname', 'eventname', 'location', 'date', 'gathering', 'budget']
deal_Feat = ['asid', 'dealid', 'dealowner', 'artists', 'artisturls']

# Spliting the Dataframe into two parts
PYR = df[PYR_Feat].copy()
deal = df[deal_Feat].copy()

In [18]:
PYR.head()

Unnamed: 0,asid,dealid,clientname,categoryname,eventname,location,date,gathering,budget
0,272,3255,Amit,SINGER,charity,"Nagpur, Maharashtra, India",2018-06-16,2000.0,200000
1,338,4995,Kavita,ANCHOR/EMCEE,corporate,"Andheri East, Mumbai, Maharashtra, India",2018-11-16,1000.0,85000
2,300,5814,Gautam Kapoor,LIVE BAND,wedding,Jim corbett,2019-08-02,160.0,30000
3,268,5871,Rajat Aggarwal,ANCHOR/EMCEE,exhibition,"Beauty & Spa, praghati maidan",2018-05-28,2000.0,10000
4,320,6395,Ajay Kumar dhiman,PHOTO/VIDEOGRAPHER,wedding,Chandigarh,2018-10-11,250.0,150000


In [19]:
deal.head()

Unnamed: 0,asid,dealid,dealowner,artists,artisturls
0,272,3255,Shambhavi,"15,90,50,15,90,53,15,70,00,00,00,00,00,00,00,0...","abhigyan-das, deepak-malik, mohit-gaur, pawand..."
1,338,4995,Shalini,"40,41,30,64,09,72,93,40,00,00,00,00,00,00,00,0...","divya-malik, saddvi-bajaj, vandana-bisht, aish..."
2,300,5814,Shambhavi,332925197241443000000000000,"anhad-the-band, rangreza-the-band, big-m, mera..."
3,268,5871,Prasanta,"2,43,65,02,72,02,40,56,00,00,00,00,00,00,00,00...","anchor-ekta-sharma, anchor-shina-gujral, kanis..."
4,320,6395,Nitisha,"2,39,37,22,63,29,34,12,00,00,00,00,00,00,00,00...","aj-frames, soulmate-weddings, vivah-moments, s..."


## Building User Similarity Metrics

In [20]:
# Clean the Dataset
def clean_dataset(x):
    if isinstance(x, str):
        return str.lower(x.replace(" ", "").replace("&", ",").replace("/",","))
    else:
        return ''

In [22]:
# Clean the dataframe using clean_dataset() function
feat = ['categoryname', 'eventname', 'location']
for f in feat:
    PYR[f] = PYR[f].apply(clean_dataset)

In [21]:
# Create a Soup i.e 
## A new Column that combines Category, EventName and Location
## for applying TF-IDF
def create_soup(x):
    return ''.join(x['categoryname']) + ', ' + ''.join(x['eventname']) + ', ' + ''.join(x['location'])

In [23]:
# Create SOUP
PYR['soup'] = PYR.apply(create_soup, axis = 1)

In [24]:
PYR.head()

Unnamed: 0,asid,dealid,clientname,categoryname,eventname,location,date,gathering,budget,soup
0,272,3255,Amit,singer,charity,"nagpur,maharashtra,india",2018-06-16,2000.0,200000,"singer, charity, nagpur,maharashtra,india"
1,338,4995,Kavita,"anchor,emcee",corporate,"andherieast,mumbai,maharashtra,india",2018-11-16,1000.0,85000,"anchor,emcee, corporate, andherieast,mumbai,ma..."
2,300,5814,Gautam Kapoor,liveband,wedding,jimcorbett,2019-08-02,160.0,30000,"liveband, wedding, jimcorbett"
3,268,5871,Rajat Aggarwal,"anchor,emcee",exhibition,"beauty,spa,praghatimaidan",2018-05-28,2000.0,10000,"anchor,emcee, exhibition, beauty,spa,praghatim..."
4,320,6395,Ajay Kumar dhiman,"photo,videographer",wedding,chandigarh,2018-10-11,250.0,150000,"photo,videographer, wedding, chandigarh"


### TF-IDF
Term Frequency - Inverse Document Frequency

In [25]:
# Setup tfidfvectorizer with english stop words
tfidf_vect = TfidfVectorizer(stop_words="english")

# Fit the SOUP data into TfidfVectorizer to get a tfidf matrix
tfidf_mat = tfidf_vect.fit_transform(PYR['soup'].values)

In [26]:
# Using Tfidf matrix to get a similarity matrix
## Using linear_kernel() available in sklearn to process fast
similarity = linear_kernel(tfidf_mat, tfidf_mat)

In [29]:
# Mapping of Index - DealID
indices = pd.Series(PYR.index, index = PYR['dealid'])

# Function to get similar users
def getSimilar(deal_id):

    # Get 'Index' for the given DealID
    idx = indices[deal_id]
    
    # Calculate Similarity based on the 'Index', sort the score in revere order
    score = sorted(list(enumerate(similarity[idx])), key = lambda x: x[1], reverse = True)
    # Get User + TOP 5 Similar Users
    score = score[0:6]
    
    # Get 'Index' of Similar Users
    ids = [i[0] for i in score]
    
    # Return the Details of Similar Users
    return PYR.iloc[ids]

In [30]:
# Get Users which are Similar to User with DealID = 3255 
getSimilar(3255)

Unnamed: 0,asid,dealid,clientname,categoryname,eventname,location,date,gathering,budget,soup
0,272,3255,Amit,singer,charity,"nagpur,maharashtra,india",2018-06-16,2000.0,200000,"singer, charity, nagpur,maharashtra,india"
13,280,6480,Mahendra,liveband,privateparty,"thane,maharashtra,india",2018-06-22,130.0,30000,"liveband, privateparty, thane,maharashtra,india"
1,338,4995,Kavita,"anchor,emcee",corporate,"andherieast,mumbai,maharashtra,india",2018-11-16,1000.0,85000,"anchor,emcee, corporate, andherieast,mumbai,ma..."
52,341,6680,amol telang,liveband,concertfestival,"vashi,navimumbai,maharashtra,india",2018-09-16,200.0,60000,"liveband, concertfestival, vashi,navimumbai,ma..."
60,357,6732,Tushar,singer,corporate,indore,2018-07-29,250.0,200000,"singer, corporate, indore"
76,396,6807,Hartaj,singer,wedding,delhi,2018-10-28,300.0,200000,"singer, wedding, delhi"


In [31]:
# Get Users which are Similar to User with DealID = 5814 
getSimilar(5814)

Unnamed: 0,asid,dealid,clientname,categoryname,eventname,location,date,gathering,budget,soup
2,300,5814,Gautam Kapoor,liveband,wedding,jimcorbett,2019-08-02,160.0,30000,"liveband, wedding, jimcorbett"
17,296,6497,G.l verma,liveband,wedding,jammu,2018-10-18,3000.0,80000,"liveband, wedding, jammu"
22,299,6536,Dev khanna,liveband,wedding,jhunjhnu,2018-05-28,0.0,25000,"liveband, wedding, jhunjhnu"
77,385,6808,Saisandeep,liveband,wedding,vijayawada,2018-06-23,1200.0,100000,"liveband, wedding, vijayawada"
14,294,6490,Olive,liveband,wedding,"chennai,tamilnadu,india",2018-11-25,500.0,30000,"liveband, wedding, chennai,tamilnadu,india"
80,394,6819,yashima,liveband,privateparty,newdelhi,2018-06-17,350.0,200000,"liveband, privateparty, newdelhi"


Model is not 100% Accurate but given the amount of data it works well.

User with similar requirements are appearing to be closer to each other.

## END