We provide the following dataset (ASSIGNMENT.csv), you can pick either the recording or the composition data included and work on the
following format:

| title | writers |
| --- | --- |
| Yellow submarine | Leo Ouha |
| Anaconda | Mick George |
| Shape of you | Ed Sheeran |

* Extract the top 100 keywords in the title using TfidfVectorizer.

* Remove stopwords and calculate the same.

* Extract the top 100 2- grams and 3-grams (user term as gram not characters)

* Extract the list of unique writers and calculate their frequency in the dataset.

* Calculate the top 10 co-occurrence of writers .

* Recognize the duplicates in the dataset and export a csv with the fixed rows.

* Report and evaluate the results.

In [120]:
import pandas as pd
import numpy as np
from collections import Counter
from nltk import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [121]:
filename = "ASSIGNMENT.csv"
dataset = pd.read_csv("dataset/%s"%filename)
composition_cols = [i for i in dataset.columns if "comp" in i.lower()]
recording_cols = [i for i in dataset.columns if "recording" in i.lower()]

In [122]:
dataset.head(20)

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
0,KOKAINA,YASSINE BAYBAH|DANIEL DLOUHY,Kokaina,A BAYBAH C DLOUHY
1,POR ESTAR CONTIGO,"MARTINEZ ESCAMILLA,FELIPE DE JESUS",Estar Contigo,MARTINEZ DE UBAGO RODRIGUEZ ALEJANDRO
2,Gallardo (feat. Rick Ross & Yo Gotti),William Alfred / Karmin Kharbouch / Mario Mims...,Connect the Dots (feat. Yo Gotti and Rick Ross),MARIO MIMS|NIKOLAS PAPAMITROU|RICK ROSS|ROBERT...
3,LESSON IN LEAVING,MAHER B/GOODRUM C,Lesson In Leavin',GOODRUM
4,QUÃ©DATE EN MIS BRAZOS QUEDATE EN MIS BRAZOS,KIKE SANTANDER,Quédate En Mis Brazos,SANTANDER KIKE
5,diamonds,sia furler mikkel eriksen tor hermansen,Diamonds,BENJAMIN LEVIN/MIKKEL STORLEER ERIKSEN/SIA KAT...
6,Can'T Take My Eyes Off Of You,Robert Gaudio|Bob Crewe,Can't Take My Eyes Off You,BOB CREWE
7,corazon,jimenez jose hernandez,"Corazon, Corazon",JIMENEZ SANDOVAL JOSE ALFREDO
8,CHEAP ASS WEAVE,"Belcalis ALMANZAR,Thomas Patrick BRODERICK",Cheap Ass Weave,ALMANZAR/BRODERICK/CA ALMANZAR
9,BOUND 2,KANYE WEST,Bound 2,Bobby Dukes/Bobby Massey/Charlie Wilson/Che J....


In [123]:
comp_dataset = dataset.loc[:, composition_cols].copy()
record_dataset = dataset.loc[:, recording_cols].copy()

In [124]:
dataset.isna().any()

Composition Title      False
Composition Writers     True
Recording Title        False
Recording Writers       True
dtype: bool

In [125]:
dataset.loc[dataset["Composition Writers"].isna(), :]

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
494,THIS IS WAR,,Falconshield - This Is War 2: Piltover vs Zaun...,CA FLOBERG MARTIN DAVID/PA FALCONSHIELD/obo NCB
749,BYE BYE BABY,,BYE BYE BABY,BOB CREWE/BOB GAUDIO
1031,Think About,,뮤직비디오 1번트랙 Think About Chu,ASOTOUNION 1/ASOTOUNION 2/ASOTOUNION 3/ASOTOUN...
1115,WHERE DO I GO FROM HERE,,WHERE DO I GO FROM HERE,LARRY GROSSMAN;MARTY PANZER
1419,WALK AWAY,,"Dokken - ""Walk Away"" (Official Music Video)","DOKKEN, DON/LYNCH, GEORGE/PILSON, JEFF"
1453,smells like teen spirit,,SMELLS LIKE TEEN SPIRIT,Dave Grohl|Krist Novoselic|Kurt Cobain
1532,COME BACK TO ME,,TILL YOU COME BACK TO ME,"DAVIS VALERIE,GELLER HARVEY,WHITE KARYN"
1770,A LITTLE PRIEST,,A LITTLE PRIEST,Stephen Sondheim
1778,LONG TIME,,Some Things Last A Long Long Time,Daniel Johnston/Jad Fair
1878,JAZZ MUSIC,,Relax Winter Jazz Music - Soothing Winter Coff...,


In [126]:
dataset.loc[dataset["Recording Writers"].isna(), :]

Unnamed: 0,Composition Title,Composition Writers,Recording Title,Recording Writers
17,KING OF THE NIGHT,Thomas Sean MCMAHON,KING OF THE NIGHT,
31,GIVE ME ONE MORE CHANCE,ABRAHAM JR. QUINTANILLA,Give Me One More Chance,
35,SI QUIERES AMARLA,"DOURGE,PAUL|VILAS,GUILLERMO",Si Quieres Amarla,
39,REBECCA & JACK (THAT'LL BE THE DAY),"KHOSLA, SIDDHARTHA",REBECCA & JACK (THAT'LL BE THE DAY)-28221,
49,LETTRE A MA SOEUR,"BRAMS|KAMELANCIEN|MARIA,CHEBA",Lettre à ma soeur (feat. Cheba Maria),
...,...,...,...,...
2532,QUANTUM ENTANGLEMENT,Michael Meinl,Quantum Entanglement,
2537,IN THE MEADOW,PETER KOOBS,In the Meadow,
2548,PRELUDE IN C (ADAPTATION),CAROL TORNQUIST|JOHANN SEBASTIAN BACH,Prelude In C (Angel Song Album Version),
2568,FUMBLING OVER WORDS THAT RHYME,Edan PORTNOY,Fumbling Over Words That Rhyme 3,


### For the first run i will pick the composisition title & writers pair ###

First of all I'm going to remove those nan values that exist in the dataset since they dont offer any information

In [127]:
comp_dataset = comp_dataset.dropna()

Afterwards we will lowercase every token / sentence in order to have a uniformity in our sentences.

In [128]:
for i in range(comp_dataset.shape[-1]):
    comp_dataset.iloc[:, i] = comp_dataset.iloc[:,i].apply(lambda x: x.lower())

In [129]:
comp_dataset.columns = ["Title", "Writers"]

In [130]:
def get_n_values(comp_dataset, n=100, stopwords=False):
    if stopwords:
        stopwords = "english"
        tfidf_vect = TfidfVectorizer(analyzer='word', stop_words=stopwords)
    else:
        tfidf_vect = TfidfVectorizer(analyzer='word')
        
    tfidf_wm = tfidf_vect.fit_transform([" ".join(comp_dataset["Title"].tolist())])
    tfidf_tokens = tfidf_vect.get_feature_names()
    results = pd.DataFrame()
    results["score"] = tfidf_wm.data
    results.index = tfidf_tokens
    results = results.sort_values(by="score").tail(100)
    return results

In [131]:
def get_top_n_grams(comp_dataset,n_gram, field, n=100):
    corpora = " ".join(comp_dataset[field].tolist())
    ngram_counts = Counter(ngrams(corpora.split(), n_gram))
    return ngram_counts.most_common(n)

In [132]:
get_n_values(comp_dataset.copy())



Unnamed: 0,score
silent,0.020404
version,0.020404
vu,0.020404
samba,0.022259
soldiers,0.022259
...,...
wrld,0.194762
yadan,0.252264
whomp,0.315329
wow,0.411783


In [133]:
get_n_values(comp_dataset.copy(), stopwords=True)



Unnamed: 0,score
korsakov,0.027668
potro,0.027668
wallace,0.027668
upside,0.027668
vogue,0.027668
...,...
wolves,0.154151
turn,0.166009
wavin,0.185772
vichre,0.312255


Extract the top 100 2- grams and 3-grams (user term as gram not characters)

In [134]:
print(get_top_n_grams(comp_dataset, 2,"Title"))

[(('of', 'the'), 30), (('love', 'you'), 29), (('i', 'love'), 24), (('in', 'the'), 22), (('the', 'world'), 16), (('i', 'am'), 16), (('it', 'is'), 14), (('to', 'me'), 13), (('bad', 'asset'), 12), (('let', 'me'), 12), (('flower', 'of'), 11), (('of', 'scotland'), 11), (('no', 'te'), 11), (('with', 'me'), 10), (('to', 'the'), 10), (('i', 'want'), 10), (('want', 'to'), 10), (('love', 'me'), 10), (('on', 'you'), 10), (('my', 'life'), 9), (('do', 'not'), 9), (('for', 'you'), 9), (('baby', 'i'), 9), (('on', 'my'), 9), (('my', 'name'), 9), (('i', 'got'), 9), (('on', 'the'), 8), (('we', 'have'), 8), (('you', 'are'), 8), (('te', 'quiero'), 8), (('you', 'love'), 8), (('are', 'you'), 8), (('do', 'it'), 8), (('por', 'ti'), 8), (('with', 'you'), 8), (('que', 'no'), 8), (('amazing', 'grace'), 8), (('in', 'a'), 8), (('you', 'go'), 7), (('more', 'than'), 7), (('of', 'a'), 7), (('do', 'you'), 7), (('this', 'is'), 7), (('i', 'need'), 7), (('all', 'the'), 7), (('hold', 'on'), 7), (('my', 'eyes'), 6), (('i',

In [135]:
print(get_top_n_grams(comp_dataset, 3,"Title"))

[(('i', 'love', 'you'), 20), (('flower', 'of', 'scotland'), 11), (('un', 'millon', 'de'), 6), (('you', 'love', 'me'), 6), (('of', 'the', 'world'), 6), (('baby', 'i', 'love'), 6), (('more', 'than', 'feeling'), 5), (('friend', 'we', 'have'), 5), (('we', 'have', 'in'), 5), (('have', 'in', 'jesus'), 5), (('call', 'out', 'my'), 5), (('out', 'my', 'name'), 5), (('no', 'te', 'vayas'), 5), (('let', 'me', 'love'), 5), (('one', 'that', 'got'), 5), (('that', 'got', 'away'), 5), (('this', 'christmas', '(hang'), 5), (('christmas', '(hang', 'all'), 5), (('(hang', 'all', 'the'), 5), (('all', 'the', 'mistletoe)'), 5), (('twinkle', 'twinkle', 'little'), 4), (('twinkle', 'little', 'star'), 4), (('millon', 'de', 'lagrimas'), 4), (('let', 'you', 'go'), 4), (('what', 'a', 'friend'), 4), (('a', 'friend', 'we'), 4), (('lovely', 'day', '(part'), 4), (('day', '(part', 'ii)'), 4), (('i', 'want', 'to'), 4), (('fields', 'of', 'athenry'), 4), (('i', 'want', 'you'), 4), (('be', 'with', 'you'), 4), (('got', 'my', 'e

Extract the list of unique writers and calculate their frequency in the dataset.

In [136]:
seperators = ["|", "/", "\\","+","-", "\n", "\t"]

writers = comp_dataset["Writers"].tolist()

In [137]:
writers_cp = []
for i in writers:
    tmp = []
    for element in i:
        if element in seperators:
            tmp.append(",")
        else:
            tmp.append(element)
        
    writers_cp.append("".join(tmp))

In [138]:
ai = [i.split(",") for i in writers_cp]

In [139]:
ai = [j.strip() for i in ai for j in i]

In [141]:
Counter(ai)

Counter({'yassine baybah': 1,
         'daniel dlouhy': 1,
         'martinez escamilla': 1,
         'felipe de jesus': 1,
         'william alfred': 2,
         'karmin kharbouch': 2,
         'mario mims': 2,
         'richard morales': 2,
         'rick ross': 4,
         'maher b': 2,
         'goodrum c': 2,
         'kike santander': 3,
         'sia furler mikkel eriksen tor hermansen': 1,
         'robert gaudio': 1,
         'bob crewe': 1,
         'jimenez jose hernandez': 3,
         'belcalis almanzar': 4,
         'thomas patrick broderick': 1,
         'kanye west': 5,
         'marvel': 1,
         'a.': 1,
         'maye': 1,
         'marjorie': 1,
         'powers': 1,
         'amy': 1,
         'paul bateman franz xaver gruber': 2,
         'white': 3,
         'jack': 5,
         'joe tex': 1,
         'john williams': 2,
         'collins': 9,
         'william bootsy': 5,
         'clinton': 6,
         'george': 10,
         'sterling': 1,
         'donnie': 2