# Machine Learning Homework 5
\> БИБ201 Рудзянский Артемий 

## Кластеризация: О вкусах не спорят
**Задача**
- Создание своего датасета  
        features: text  
        label: artist
    - Очистка текста от знаков препинания
    - Использование Encoder
- Кластеризация полученного датасета
- Анализ результатов

## Создание своего датасета
Выбранная структура датасета была представлена выше.

Я решил использовать публичный API для получения текстов песен.

Процесс:
1) Я выбрал несколько артистов разных жанров (с целью, чтобы их тексты больше различались)
2) Получил список песен выбранных исполнителей
2) Через [ChartLyrics Lyric API](http://www.chartlyrics.com/api.aspx) получил тексты песен 

0) Eminem  
1) Philip Wesley  
2) NEFFEX  
3) Ed Sheeran  
4) The Beatles  
5) System of A Down  
6) Gorillaz  
7) Toto  
8) TOOL  
9) Nightwish 

In [26]:
import numpy as np
import pandas as pd
import requests
try:
    import xmltodict
except ImportError as e:
    !pip install xmltodict
    import xmltodict

### Демонстрация обработки изначального csv файла на примере *Eminem.csv
Последующие csv файлы будут обработаны аналогично

Для начала напишем функцию получения текста песни при помощи публичного API
Также напишем функцию, чистящую полученный текст, оставляя только одинарные пробелы и сам текст заглавными буквами

In [27]:
# Функция получения текста песни
def getLyrics(artist, song):
    # Параметры запроса
    url = "http://api.chartlyrics.com/apiv1.asmx/SearchLyricDirect"

    querystring = {"artist": artist, "song": song}

    headers = {
    	"X-RapidAPI-Key": "SIGN-UP-FOR-KEY",
    	"X-RapidAPI-Host": "sridurgayadav-chart-lyrics-v1.p.rapidapi.com"
    }
    # Выполняем запрос
    try:
        response = requests.request("GET", url, headers=headers, params=querystring)
        response.raise_for_status()
    except requests.exceptions.HTTPError as err:
        raise ValueError("Bad Response")

    # Обрабатываем ответ
    dict_data = xmltodict.parse(response.content)['GetLyricResult']
    if "LyricSong" in dict_data:
        if dict_data["LyricArtist"].upper() == artist.upper() and dict_data["LyricSong"].upper() == song.upper() and dict_data["Lyric"] is not None:
            lyrics = dict_data["Lyric"]
            return clean_lyrics(lyrics) 
        else:
            raise ValueError(f"Not found exact song {song}, {artist}")
    else:
        raise ValueError("Not found anything about the song")


# Функция очищения текста от лишних символов
check_letter = lambda c: (64 < ord(c) and ord(c) < 91) or ord(c) == 32
def clean_lyrics(lyrics):
    lyrics = lyrics.upper()
    lyrics = ''.join(list(filter(check_letter, list(lyrics))))

    return " ".join(lyrics.split())

In [28]:
eminem_df = pd.read_csv("Eminem.csv", names=['Track', 'Artist'], header=0, usecols=['Track', 'Artist'])
eminem_df = eminem_df[eminem_df["Artist"] == "Eminem"] # remain only given artist
display(eminem_df.head(), eminem_df.shape)

for track in eminem_df["Track"].values:
    # condition excludes names with symbols like "(feat. ) -"
    if True or all( (64 < ord(c) and ord(c) < 123) or ord(c) == 32  for c in track):
        try:
            lyrics = getLyrics("Eminem", track)
            # put lyrics instead a name of a track
            eminem_df = eminem_df.replace(to_replace=track, value=lyrics)
        except ValueError: # delete track if we didn't find lyric
            eminem_df = eminem_df[eminem_df['Track'] != track]
    else: # delete track if we didn't find lyrics
        eminem_df = eminem_df[eminem_df['Track'] != track]

display(eminem_df.head(), eminem_df.shape)

Unnamed: 0,Track,Artist
0,"Lose Yourself - From ""8 Mile"" Soundtrack",Eminem
1,The Real Slim Shady,Eminem
2,Stan,Eminem
3,Till I Collapse,Eminem
4,My Name Is,Eminem


(52, 2)

Unnamed: 0,Track,Artist
1,MAY I HAVE YOUR ATTENTION PLEASEMAY I HAVE YOU...,Eminem
4,HI MY NAME IS WHATMY NAME IS WHOMY NAME IS CHI...,Eminem
7,INTROWHERES MY SNAREI HAVE NO SNARE ON MY HEAD...,Eminem
8,NOT AFRAIDCHORUSIM NOT AFRAID IM NOT AFRAIDTO ...,Eminem
9,MAN WHATEVERDRE JUST LET IT RUNAYO TURN THE BE...,Eminem


(16, 2)

Выше был представлен наглядный пример превращения начального csv файла в готовый кусок датасета
Теперь же мы напишем функцию, которая позволит перебрать всех артистов через цикл

In [29]:
def from_csv_to_df(artist):
    df = pd.read_csv(f"{artist}.csv", names=['Track', 'Artist'], header=0, usecols=['Track', 'Artist'])
    df = df[df['Artist'] == artist]
    # artist = artist.upper()
    # df = df.applymap(lambda s: s.upper() if type(s) == str else s)
    for track in df['Track'].values:
        if True or all( (64 < ord(c) and ord(c) < 123) or ord(c) == 32 for c in track):
            try:
               lyrics = getLyrics(artist, track)
               df = df.replace(to_replace=track, value=lyrics)
            except ValueError as err:
                print(err)
                df = df[df['Track'] != track]
        else:
            df = df[df['Track'] != track] 
    return df

Выпишем интересующих нас артистов в виде листа

In [30]:
arr_artists = [
    # 'Eminem',  
    # 'Sabaton',
    'Elton John', 
    'Michael Jackson',
    'The Beatles',
    'System of A Down',
    'Gorillaz', 
    'TOTO', 
    'TOOL', 
    'Nightwish'
]

df = eminem_df
for artist in arr_artists:
    df_new = from_csv_to_df(artist)
    df = pd.concat([df, df_new])


Not found exact song Hold Me Closer, Elton John
Not found exact song Cold Heart - PNAU Remix, Elton John
Not found exact song Rocket Man (I Think It's Going To Be A Long, Long Time), Elton John
Bad Response
Not found exact song Bennie And The Jets - , Elton John
Not found exact song Goodbye Yellow Brick Road - , Elton John
Not found exact song Finish Line, Elton John
Not found exact song I Guess That's Why They Call It The Blues, Elton John
Not found exact song Can You Feel The Love Tonight - Remastered, Elton John
Not found exact song Circle Of Life - Remastered, Elton John
Not found exact song Don't Go Breaking My Heart, Elton John
Not found exact song Song For Guy, Elton John
Not found exact song (I'm Gonna) Love Me Again, Elton John
Not found exact song Candle In The Wind - , Elton John
Not found exact song Always Love You, Elton John
Not found exact song Saturday Night’s Alright (For Fighting) - , Elton John
Not found exact song Bennie And The Jets - Recorded At The Colosseum, Cae

In [31]:
df = df.loc[:, ["Artist", "Track"]]
display(df.shape, df)
df.to_csv("out.csv", index=None)

(186, 2)

Unnamed: 0,Artist,Track
1,Eminem,MAY I HAVE YOUR ATTENTION PLEASEMAY I HAVE YOU...
4,Eminem,HI MY NAME IS WHATMY NAME IS WHOMY NAME IS CHI...
7,Eminem,INTROWHERES MY SNAREI HAVE NO SNARE ON MY HEAD...
8,Eminem,NOT AFRAIDCHORUSIM NOT AFRAID IM NOT AFRAIDTO ...
9,Eminem,MAN WHATEVERDRE JUST LET IT RUNAYO TURN THE BE...
...,...,...
41,Nightwish,TOLL NO BELL FOR ME FATHERBUT LET THIS CUP OF ...
42,Nightwish,A LADY WITH A VIOLINPLAYING TO THE SEALSHEARKE...
45,Nightwish,A GRAND OASIS IN THE VASTNESS OF GLOOMCHILD OF...
46,Nightwish,I WANT TO SEE WHERE THE SIRENS SINGHEAR HOW TH...


In [32]:
df

Unnamed: 0,Artist,Track
1,Eminem,MAY I HAVE YOUR ATTENTION PLEASEMAY I HAVE YOU...
4,Eminem,HI MY NAME IS WHATMY NAME IS WHOMY NAME IS CHI...
7,Eminem,INTROWHERES MY SNAREI HAVE NO SNARE ON MY HEAD...
8,Eminem,NOT AFRAIDCHORUSIM NOT AFRAID IM NOT AFRAIDTO ...
9,Eminem,MAN WHATEVERDRE JUST LET IT RUNAYO TURN THE BE...
...,...,...
41,Nightwish,TOLL NO BELL FOR ME FATHERBUT LET THIS CUP OF ...
42,Nightwish,A LADY WITH A VIOLINPLAYING TO THE SEALSHEARKE...
45,Nightwish,A GRAND OASIS IN THE VASTNESS OF GLOOMCHILD OF...
46,Nightwish,I WANT TO SEE WHERE THE SIRENS SINGHEAR HOW TH...


Я считаю, что разброс сета не имеет кластерную структуру. Так как большинство ходовых слов встречается во всех текстах вне зависимости от жанра и артиста.
Здесь стоило бы использовать анализ словосочетаний. Это бы намного сильнее увеличило эффективность. Так же я считаю, что тут надо использовать более детальный подход, чтобы учитывать больше аспектов модели. В нашем случае модель слишком простая.

In [33]:
artist = 'TOTO'
df = pd.read_csv(f"{artist}.csv", names=['Track', 'Artist'], header=0, usecols=['Track', 'Artist'])
display(df)
df = df[df['Artist'] == artist]
display(df)

Unnamed: 0,Track,Artist
0,Africa,TOTO
1,Rosanna,TOTO
2,Hold the Line,TOTO
3,I Won't Hold You back,TOTO
4,Make Believe,TOTO
...,...,...
80,These Chains,TOTO
81,Lion,TOTO
82,We Made It,TOTO
83,Gypsy Train,TOTO


Unnamed: 0,Track,Artist
0,Africa,TOTO
1,Rosanna,TOTO
2,Hold the Line,TOTO
3,I Won't Hold You back,TOTO
4,Make Believe,TOTO
...,...,...
80,These Chains,TOTO
81,Lion,TOTO
82,We Made It,TOTO
83,Gypsy Train,TOTO
