We provide the following dataset (ASSIGNMENT.csv), you can pick either the recording or the composition data included and work on the
following format:

| title | writers |
| --- | --- |
| Yellow submarine | Leo Ouha |
| Anaconda | Mick George |
| Shape of you | Ed Sheeran |

* Extract the top 100 keywords in the title using TfidfVectorizer.

* Remove stopwords and calculate the same.

* Extract the top 100 2- grams and 3-grams (user term as gram not characters)

* Extract the list of unique writers and calculate their frequency in the dataset.

* Calculate the top 10 co-occurrence of writers .

* Recognize the duplicates in the dataset and export a csv with the fixed rows.

* Report and evaluate the results.

In [37]:
import pandas as pd
import numpy as np
from collections import Counter
from nltk import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [38]:
filename = "ASSIGNMENT.csv"
dataset = pd.read_csv("dataset/%s"%filename)
composition_cols = [i for i in dataset.columns if "comp" in i.lower()]
recording_cols = [i for i in dataset.columns if "recording" in i.lower()]

In [39]:
dataset.head(5)

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
0,KOKAINA,YASSINE BAYBAH|DANIEL DLOUHY,Kokaina,A BAYBAH C DLOUHY
1,POR ESTAR CONTIGO,"MARTINEZ ESCAMILLA,FELIPE DE JESUS",Estar Contigo,MARTINEZ DE UBAGO RODRIGUEZ ALEJANDRO
2,Gallardo (feat. Rick Ross & Yo Gotti),William Alfred / Karmin Kharbouch / Mario Mims...,Connect the Dots (feat. Yo Gotti and Rick Ross),MARIO MIMS|NIKOLAS PAPAMITROU|RICK ROSS|ROBERT...
3,LESSON IN LEAVING,MAHER B/GOODRUM C,Lesson In Leavin',GOODRUM
4,QUÃ©DATE EN MIS BRAZOS QUEDATE EN MIS BRAZOS,KIKE SANTANDER,Quédate En Mis Brazos,SANTANDER KIKE


In [40]:
comp_dataset = dataset.loc[:, composition_cols].copy()
record_dataset = dataset.loc[:, recording_cols].copy()

In [41]:
#check if there are any nan values in the dataset.

dataset.isna().any()

Composition Title      False
Composition Writers     True
Recording Title        False
Recording Writers       True
dtype: bool

In [42]:
dataset.loc[dataset["Composition Writers"].isna(), :].head(5)

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
494,THIS IS WAR,,Falconshield - This Is War 2: Piltover vs Zaun...,CA FLOBERG MARTIN DAVID/PA FALCONSHIELD/obo NCB
749,BYE BYE BABY,,BYE BYE BABY,BOB CREWE/BOB GAUDIO
1031,Think About,,뮤직비디오 1번트랙 Think About Chu,ASOTOUNION 1/ASOTOUNION 2/ASOTOUNION 3/ASOTOUN...
1115,WHERE DO I GO FROM HERE,,WHERE DO I GO FROM HERE,LARRY GROSSMAN;MARTY PANZER
1419,WALK AWAY,,"Dokken - ""Walk Away"" (Official Music Video)","DOKKEN, DON/LYNCH, GEORGE/PILSON, JEFF"


In [43]:
dataset.loc[dataset["Recording Writers"].isna(), :].head(5)

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
17,KING OF THE NIGHT,Thomas Sean MCMAHON,KING OF THE NIGHT,
31,GIVE ME ONE MORE CHANCE,ABRAHAM JR. QUINTANILLA,Give Me One More Chance,
35,SI QUIERES AMARLA,"DOURGE,PAUL|VILAS,GUILLERMO",Si Quieres Amarla,
39,REBECCA & JACK (THAT'LL BE THE DAY),"KHOSLA, SIDDHARTHA",REBECCA & JACK (THAT'LL BE THE DAY)-28221,
49,LETTRE A MA SOEUR,"BRAMS|KAMELANCIEN|MARIA,CHEBA",Lettre à ma soeur (feat. Cheba Maria),


### For the first run i will pick the composisition title & writers pair ###

First of all I'm going to remove those nan values that exist in the dataset since they dont offer any information

In [44]:
comp_dataset = comp_dataset.dropna()

Afterwards we will lowercase every token / sentence in order to have a uniformity in our sentences.

In [45]:
for i in range(comp_dataset.shape[-1]):
    comp_dataset.iloc[:, i] = comp_dataset.iloc[:,i].apply(lambda x: x.lower())

In [46]:
comp_dataset.columns = ["Title", "Writers"]

In [47]:
def get_n_values(comp_dataset, n=100, stopwords=False):
    if stopwords:
        stopwords = "english"
        tfidf_vect = TfidfVectorizer(analyzer='word', stop_words=stopwords)
    else:
        tfidf_vect = TfidfVectorizer(analyzer='word')
        
    tfidf_wm = tfidf_vect.fit_transform([" ".join(comp_dataset["Title"].tolist())])
    tfidf_tokens = tfidf_vect.get_feature_names()
    results = pd.DataFrame()
    results["score"] = tfidf_wm.data
    results.index = tfidf_tokens
    results = results.sort_values(by="score").tail(n)
    return results

In [48]:
def get_top_n_grams(comp_dataset,n_gram, field, n=100):
    corpora = " ".join(comp_dataset[field].tolist())
    ngram_counts = Counter(ngrams(corpora.split(), n_gram))
    return ngram_counts.most_common(n)

In [49]:
get_n_values(comp_dataset.copy())



Unnamed: 0,score
silent,0.020404
version,0.020404
vu,0.020404
samba,0.022259
soldiers,0.022259
...,...
wrld,0.194762
yadan,0.252264
whomp,0.315329
wow,0.411783


In [50]:
get_n_values(comp_dataset.copy(), stopwords=True)



Unnamed: 0,score
korsakov,0.027668
potro,0.027668
wallace,0.027668
upside,0.027668
vogue,0.027668
...,...
wolves,0.154151
turn,0.166009
wavin,0.185772
vichre,0.312255


Extract the top 100 2-grams and 3-grams (user term as gram not characters)

In [51]:
print(get_top_n_grams(comp_dataset, 2,"Title"))

[(('of', 'the'), 30), (('love', 'you'), 29), (('i', 'love'), 24), (('in', 'the'), 22), (('the', 'world'), 16), (('i', 'am'), 16), (('it', 'is'), 14), (('to', 'me'), 13), (('bad', 'asset'), 12), (('let', 'me'), 12), (('flower', 'of'), 11), (('of', 'scotland'), 11), (('no', 'te'), 11), (('with', 'me'), 10), (('to', 'the'), 10), (('i', 'want'), 10), (('want', 'to'), 10), (('love', 'me'), 10), (('on', 'you'), 10), (('my', 'life'), 9), (('do', 'not'), 9), (('for', 'you'), 9), (('baby', 'i'), 9), (('on', 'my'), 9), (('my', 'name'), 9), (('i', 'got'), 9), (('on', 'the'), 8), (('we', 'have'), 8), (('you', 'are'), 8), (('te', 'quiero'), 8), (('you', 'love'), 8), (('are', 'you'), 8), (('do', 'it'), 8), (('por', 'ti'), 8), (('with', 'you'), 8), (('que', 'no'), 8), (('amazing', 'grace'), 8), (('in', 'a'), 8), (('you', 'go'), 7), (('more', 'than'), 7), (('of', 'a'), 7), (('do', 'you'), 7), (('this', 'is'), 7), (('i', 'need'), 7), (('all', 'the'), 7), (('hold', 'on'), 7), (('my', 'eyes'), 6), (('i',

In [52]:
print(get_top_n_grams(comp_dataset, 3,"Title"))

[(('i', 'love', 'you'), 20), (('flower', 'of', 'scotland'), 11), (('un', 'millon', 'de'), 6), (('you', 'love', 'me'), 6), (('of', 'the', 'world'), 6), (('baby', 'i', 'love'), 6), (('more', 'than', 'feeling'), 5), (('friend', 'we', 'have'), 5), (('we', 'have', 'in'), 5), (('have', 'in', 'jesus'), 5), (('call', 'out', 'my'), 5), (('out', 'my', 'name'), 5), (('no', 'te', 'vayas'), 5), (('let', 'me', 'love'), 5), (('one', 'that', 'got'), 5), (('that', 'got', 'away'), 5), (('this', 'christmas', '(hang'), 5), (('christmas', '(hang', 'all'), 5), (('(hang', 'all', 'the'), 5), (('all', 'the', 'mistletoe)'), 5), (('twinkle', 'twinkle', 'little'), 4), (('twinkle', 'little', 'star'), 4), (('millon', 'de', 'lagrimas'), 4), (('let', 'you', 'go'), 4), (('what', 'a', 'friend'), 4), (('a', 'friend', 'we'), 4), (('lovely', 'day', '(part'), 4), (('day', '(part', 'ii)'), 4), (('i', 'want', 'to'), 4), (('fields', 'of', 'athenry'), 4), (('i', 'want', 'you'), 4), (('be', 'with', 'you'), 4), (('got', 'my', 'e

Extract the list of unique writers and calculate their frequency in the dataset.

The writers are **not seperated** in the same way all in all the rows. Given that, I will have to do a best effort
to extract the unique writers out of the dataset. It is not possible to use NER here because we don't have free text.


In [53]:
seperators = ["|", "/", "\\","+","-", "\n", "\t", ",",";"]
# these are the most common separators that were observed in the dataset.
writers = comp_dataset["Writers"].tolist()

In [54]:
writers_with_seperators = [i for i in writers if len(set(seperators).intersection(set(i)))]
writers_without_seperators = [i for i in writers if not len(set(seperators).intersection(set(i)))]

In [55]:
print("writers with seperators : ", len(writers_with_seperators))

writers with seperators :  1536


In [56]:
print("writers without seperators: ", len(writers_without_seperators))

writers without seperators:  1020


First of all we will try to get order out of the writers that contain seperators assuming that they have a uniformity in their seperation e.g. "artist A, artist B" and not "artist A | artist B , artist C".

After getting those arists seperated i will try to get the 2-grams out of the artists without a seperator.
The final (best effort) unique arstist will be the union of those two sets.


In [57]:
artists_g_a = []
for i in writers_with_seperators:
    common_sep = list(set(seperators).intersection(i))
    tmp = i.split(common_sep[0])
    tmp = [x.strip() for x in tmp]
    artists_g_a.extend(tmp)

In [58]:
artists_g_a[:10]

['yassine baybah',
 'daniel dlouhy',
 'martinez escamilla',
 'felipe de jesus',
 'william alfred',
 'karmin kharbouch',
 'mario mims',
 'richard morales',
 'rick ross',
 'maher b']

Assume that each artist has 2 names e.g. name and surname in this set.

In [59]:
unique_artists = []
wwsep = []
for i in writers_without_seperators:
    if len(i.split(" ")) == 2:
        unique_artists.append(i)
    else:
        wwsep.append(i) 

In [60]:
len(wwsep)

615

In [61]:
corpora = " ".join(wwsep)
ngram_counts = Counter(ngrams(corpora.split(), 2))

Assume that the most common 100 contain 100 unique artists out of the list of 615 that remained.

In [62]:
last_list = []
for x,y in ngram_counts.most_common(100):
    last_list.append(" ".join(x))

In [63]:
all_artists = last_list + unique_artists + artists_g_a
all_artists = [i.strip() for i in all_artists if len(i.strip())>2]

In [64]:
print("unique artists are", len(list(set(last_list + unique_artists + artists_g_a))))

unique artists are 4377


In [65]:
artist_occurences = Counter(last_list + unique_artists + artists_g_a)

In [66]:
artist_occurences.most_common()[:10]

[('traditional', 20),
 ('trad', 12),
 ('sean combs', 11),
 ('nusrat fateh ali khan', 11),
 ('dp', 11),
 ('john', 10),
 ('pharrell williams', 7),
 ('louis bell', 7),
 ('williams', 7),
 ('aubrey graham', 6)]

In [67]:
fixed_keys = list(artist_occurences.keys())
for i in fixed_keys:
    if len(i)<3:
        artist_occurences.pop(i)

In [68]:
keep = []

for line in comp_dataset["Writers"].tolist():
    line_l = set()
    for artist in artist_occurences:
        if artist in line:
            line_l.add(artist)
    if len(line_l):
        keep.append(line_l)
    

In [69]:
dict_intersections = dict()

for idi, i in enumerate(keep):
    for idj, j in enumerate(keep):
        if idj != idi:
            intersect = i.intersection(j)
            if len(intersect)<2:
                continue
            intersect = str(sorted(list(intersect)))
            # intersect = str(list(intersect).sort())
            if intersect in dict_intersections:
                dict_intersections[intersect] +=1
            else:
                dict_intersections[intersect] = 0

Calculate the top 10 co-occurrence of writers

In [70]:
sorted_artists = sorted([(i[-1], i[0]) for i in dict_intersections.items()], reverse=True)

In [71]:
sorted_artists[:10]

[(3573, "['chris', 'christopher']"),
 (1973, "['trad', 'traditional']"),
 (1627, "['jose', 'joseph']"),
 (1363, "['dan', 'daniel']"),
 (1239, "['carlo', 'carlos']"),
 (1015, "['jack', 'jackson']"),
 (797, "['john', 'johnson']"),
 (249, "['bob', 'bobby']"),
 (241, "['jose', 'jose luis']"),
 (173, "['live', 'oliver']")]

Remove duplicate from the whole dataset

In [72]:
dataset = dataset.drop_duplicates()
dataset.to_csv("excersize_1.csv")