# Clase Analisis y Presentacion de Datos

### Comparacion de artistas en base a sus canciones

Suponemos que podemos evaluar que tan parecido son dos autores en base a que tan comunmente usan cada palabra

In [1]:
import csv

data = {}

with open("songdata.csv") as file:
    for author,_,_,lyric in csv.reader(file):
    
        data[author] = data.get(author,{})
        for word in set(lyric.lower().split()):
            data[author][word] = data[author].get(word,0) + 1 

In [2]:
data["ABBA"]

{'a': 101,
 'fine': 8,
 "we'll": 8,
 'i': 103,
 'can': 62,
 'believe': 16,
 'be': 61,
 'holds': 1,
 'on': 58,
 'of': 83,
 'face': 16,
 'feel': 37,
 'the': 104,
 'who': 16,
 'one': 38,
 'walk': 6,
 'talking': 3,
 'way': 30,
 'ever': 16,
 'we': 50,
 'squeezes': 1,
 'about': 18,
 'do?': 2,
 'park': 4,
 'sees': 2,
 'leaves': 5,
 'what': 48,
 'do,': 5,
 'how': 37,
 'she': 20,
 'makes': 11,
 'plan': 2,
 'just': 47,
 'without': 10,
 'wonderful': 3,
 'kind': 12,
 'walking': 5,
 'look': 24,
 'if': 45,
 'for': 71,
 'to': 102,
 'face,': 1,
 'fellow': 2,
 'girl,': 4,
 'go': 43,
 'smiles': 1,
 'hours': 3,
 'all': 63,
 'things': 13,
 'special': 7,
 'could': 31,
 'and': 110,
 'something': 16,
 'mine?': 1,
 'it': 86,
 'when': 53,
 'my': 72,
 'that': 75,
 "i'm": 53,
 'blue': 13,
 'me': 83,
 'means': 4,
 'in': 85,
 'hand': 10,
 'her': 15,
 'at': 37,
 'lucky': 1,
 "it's": 53,
 'be?': 2,
 "she's": 7,
 "you're": 43,
 'me,': 23,
 'breeze': 4,
 'thousand': 2,
 'shimmer': 1,
 'gently': 2,
 'now': 48,
 'time':

Algo que no hicimos durante la clase, pero que si hay que hacer, es sacar las palabras que no nos dan informacion. Otra cosa que habria que hacer es sacar todos los simbolos.

Si no sabemos hacer algo, siempre podemos recurrir a Stack Overflow

In [3]:
import re
# https://stackoverflow.com/questions/875968/how-to-remove-symbols-from-a-string-with-python
def strip_simbols(string):
   return re.sub(r'[^\w]', '', string)

In [4]:
import csv

data = {}

with open("songdata.csv") as file:
    for author,_,_,lyric in csv.reader(file):
    
        data[author] = data.get(author,{})
        for word in set(lyric.lower().split()):
            word = strip_simbols(word)
            data[author][word] = data[author].get(word,0) + 1 

In [5]:
data["ABBA"]

{'a': 101,
 'fine': 9,
 'well': 27,
 'i': 114,
 'can': 64,
 'believe': 16,
 'be': 66,
 'holds': 1,
 'on': 68,
 'of': 83,
 'face': 18,
 'feel': 39,
 'the': 107,
 'who': 16,
 'one': 38,
 'walk': 7,
 'talking': 4,
 'way': 32,
 'ever': 17,
 'we': 51,
 'squeezes': 1,
 'about': 18,
 'do': 50,
 'park': 4,
 'sees': 2,
 'leaves': 5,
 'what': 49,
 'how': 37,
 'she': 20,
 'makes': 11,
 'plan': 2,
 'just': 47,
 'without': 10,
 'wonderful': 3,
 'kind': 13,
 'walking': 6,
 'look': 27,
 'if': 47,
 'for': 71,
 'to': 104,
 'fellow': 2,
 'girl': 24,
 'go': 47,
 'smiles': 2,
 'hours': 3,
 'all': 64,
 'things': 14,
 'special': 8,
 'could': 31,
 'and': 113,
 'something': 17,
 'mine': 4,
 'it': 107,
 'when': 53,
 'my': 74,
 'that': 79,
 'im': 58,
 'blue': 13,
 'me': 112,
 'means': 4,
 'in': 86,
 'hand': 12,
 'her': 15,
 'at': 37,
 'lucky': 2,
 'its': 56,
 'shes': 8,
 'youre': 43,
 'breeze': 5,
 'thousand': 2,
 'shimmer': 1,
 'gently': 2,
 'now': 54,
 'time': 31,
 'take': 32,
 'velvet': 1,
 'float': 1,
 'nig

In [6]:
for author in data:
    print(author)

artist
ABBA
Ace Of Base
Adam Sandler
Adele
Aerosmith
Air Supply
Aiza Seguerra
Alabama
Alan Parsons Project
Aled Jones
Alice Cooper
Alice In Chains
Alison Krauss
Allman Brothers Band
Alphaville
America
Amy Grant
Andrea Bocelli
Andy Williams
Annie
Ariana Grande
Ariel Rivera
Arlo Guthrie
Arrogant Worms
Avril Lavigne
Backstreet Boys
Barbie
Barbra Streisand
Beach Boys
The Beatles
Beautiful South
Beauty And The Beast
Bee Gees
Bette Midler
Bill Withers
Billie Holiday
Billy Joel
Bing Crosby
Black Sabbath
Blur
Bob Dylan
Bob Marley
Bob Rivers
Bob Seger
Bon Jovi
Boney M.
Bonnie Raitt
Bosson
Bread
Britney Spears
Bruce Springsteen
Bruno Mars
Bryan White
Cake
Carly Simon
Carol Banawa
Carpenters
Cat Stevens
Celine Dion
Chaka Khan
Cheap Trick
Cher
Chicago
Children
Chris Brown
Chris Rea
Christina Aguilera
Christina Perri
Christmas Songs
Christy Moore
Chuck Berry
Cinderella
Clash
Cliff Richard
Coldplay
Cole Porter
Conway Twitty
Counting Crows
Creedence Clearwater Revival
Crowded House
Culture Club
Cyndi

Importo las utilidades de diccionario de la primer parte del TP, de donde vamos a usar la funcion `calcular_distancia` como metrica. Las metricas siempre se definen en base a lo que vayamos a hacer, existen metricas mas o menos aceptadas para ciertas cosas, pero no evita que podamos definir las nuestras propias (con cierto fundamento claro esta).

In [7]:
import utilidades_diccionarios

Pruebo que tanto se parecen algunos artistas al azar. Voy a necesitar markdown para que se vea mas lindo, asi que puedo importar la funcion para mostrarlo directamente desde el output:

In [8]:
from IPython.display import display, Markdown, Latex

In [9]:
import random

artists = list(data.keys())

In [10]:
artist_a = random.choice(artists)
artist_b = random.choice(artists)
diference = utilidades_diccionarios.calcular_distancia(data[artist_a],data[artist_b],2)

display(Markdown( f'Diferencia entre *{artist_a}* y _{artist_b}_ es: **{diference}** '))


Diferencia entre *Roxy Music* y _Ultramagnetic Mc's_ es: **50153** 

In [11]:
artist_a = random.choice(artists)
artist_b = random.choice(artists)
diference = utilidades_diccionarios.calcular_distancia(data[artist_a],data[artist_b],2)

display(Markdown( f'Diferencia entre *{artist_a}* y _{artist_b}_ es: **{diference}** '))


Diferencia entre *Planetshakers* y _David Guetta_ es: **80956** 

In [12]:
artist_a = random.choice(artists)
artist_b = random.choice(artists)
diference = utilidades_diccionarios.calcular_distancia(data[artist_a],data[artist_b],2)

display(Markdown( f'Diferencia entre *{artist_a}* y _{artist_b}_ es: **{diference}** '))


Diferencia entre *Quietdrive* y _Ray Boltz_ es: **180532** 

In [13]:
artist_a = random.choice(artists)
artist_b = random.choice(artists)
diference = utilidades_diccionarios.calcular_distancia(data[artist_a],data[artist_b],2)

display(Markdown( f'Diferencia entre *{artist_a}* y _{artist_b}_ es: **{diference}** '))


Diferencia entre *Falco* y _Bonnie Raitt_ es: **472610** 

In [14]:
artist_a = random.choice(artists)
artist_b = random.choice(artists)
diference = utilidades_diccionarios.calcular_distancia(data[artist_a],data[artist_b],2)

display(Markdown( f'Diferencia entre *{artist_a}* y _{artist_b}_ es: **{diference}** '))


Diferencia entre *Crowded House* y _Neil Young_ es: **206362** 

# Extra

Hay una forma sencilla de armar canciones nuevas de un artista usando cosas que vimos. Si bien es una solucion "burda", las soluciones mas sofisticadas siguen la misma idea

In [15]:
import csv

data = {}

authors =  ["ABBA"]

with open("songdata.csv") as file:
    for author,_,_,lyric in csv.reader(file):
    
        if author not in authors:
            continue
    
        prev = ""
        for word in lyric.lower().split():
            new_word = word
            while new_word and not new_word[-1].isalpha():
                new_word = new_word[:-1]
                      
            data[prev] = data.get(prev,[])
            data[prev].append(new_word)
            
            prev = new_word
            
            if not word[-1].isalpha():
                data[prev] = data.get(prev,[])
                data[prev].append(word[-1])
            
                prev = word[-1]
                
        data[prev] = data.get(prev,[])
        data[prev].append("")

Lo que hice es armar un diccionario con las palabras de las canciones, y a cada una asociarle una lista con todas las palabras que aparecen despues de esa. Al ser una lista con repeticiones, si elijo una palabra al azar, es mas probable que elija una palabra que aparece mas seguido que otras. Entonces, al igual que con el predictivo, puedo armar una cancion de ABBA asi:

In [16]:
cancion = []
prev = ""
for _ in range(100):
    next_word = random.choice(data[prev])
    cancion.append(next_word)
    prev = next_word
    
display(Markdown("_" + " ".join(cancion) + "_") )

_come on the people need me you're gone  did i remember soldiers write the songs i'm pretty young and they say oh bang , this world where my life but you hear the south from afar twinkle , yeah rock and you're gone though we had never left for living without saying treat him i can't get married , you tell me of guns and there is just like i am to hard i fear when i think you're all right , he called the answer if you see is a dreamworld here like an eternal lie so good_

Incluso para ser algo rapido, alguna clase de sentido tiene. Si le queremos dar mas sentido, deberiamos analizar mas que la palabra anterior, por ejemplo las 2 anteriores, y eso deberia mejorar

In [17]:
import csv

data = {}

authors =  ["ABBA"]

with open("songdata.csv") as file:
    
    prev = ("","")
    for author,_,_,lyric in csv.reader(file):
    
        if author not in authors:
            continue
    
        for word in lyric.lower().split():
            new_word = word
            
            if not word[0].isalpha():
                data[prev] = data.get(prev,[])
                data[prev].append(word[0])
            
                prev = (prev[-1],word[0])
            
            while new_word and not new_word[0].isalpha():
                new_word = new_word[1:]
            
            while new_word and not new_word[-1].isalpha():
                new_word = new_word[:-1]
                      
            data[prev] = data.get(prev,[])
            data[prev].append(new_word)
            
            prev = (prev[-1],new_word)
            
            if not word[-1].isalpha():
                data[prev] = data.get(prev,[])
                data[prev].append(word[-1])
            
                prev = (prev[-1],word[-1])
                
        data[prev] = data.get(prev,[])
        data[prev].append("")
        prev = (prev[-1],"")


In [18]:
cancion = []
prev = ("","")
for _ in range(100):
    next_word = random.choice(data[prev])
    cancion.append(next_word)
    prev = (prev[-1],next_word)
    
display(Markdown("_" + " ".join(cancion) + "_") )

_look at that cat you'd think that i could hear someone saying as though he was an impossible case no-one ever could reach me but i have been waiting for you here like and old fashioned hero you stand before me you please me , baby so why don't you realize i may what about all those men ? ( your smile and the pain i chose to hide just walk away renee you won't have me tonight in the mirror when i just know it's true oh lord i'm blue i'm cryin ' over you cryin ' over you_

Tambien podemos combinar varias canciones de varios artistas para hacer algo aun mas loco. 

In [19]:
import csv

data = {}

authors =  ["ABBA","Stevie Wonder","Hank Williams"]

with open("songdata.csv") as file:
    
    prev = ("","")
    for author,_,_,lyric in csv.reader(file):
    
        if author not in authors:
            continue
    
        for word in lyric.lower().split():
            new_word = word
            
            if not word[0].isalpha():
                data[prev] = data.get(prev,[])
                data[prev].append(word[0])
            
                prev = (prev[-1],word[0])
            
            while new_word and not new_word[0].isalpha():
                new_word = new_word[1:]
            
            while new_word and not new_word[-1].isalpha():
                new_word = new_word[:-1]
                      
            data[prev] = data.get(prev,[])
            data[prev].append(new_word)
            
            prev = (prev[-1],new_word)
            
            if not word[-1].isalpha():
                data[prev] = data.get(prev,[])
                data[prev].append(word[-1])
            
                prev = (prev[-1],word[-1])
                
        data[prev] = data.get(prev,[])
        data[prev].append("")
        prev = (prev[-1],"")


In [20]:
cancion = []
prev = ("","")
for _ in range(100):
    next_word = random.choice(data[prev])
    cancion.append(next_word)
    prev = (prev[-1],next_word)
    
display(Markdown("_" + " ".join(cancion) + "_") )

_look at how they scare me when i ( c ) morning , noon and every night one of these days i'm gonna do my train , i'm leaving now i'm moanin ' moa-oanin ' the blues sometimes life is done should you be coming back soon you don't care what i say it's okay ' cause i know it's time in your soul is burdened down and die you and hold you tight , ' cause tonight i'm leavin ' on there's not a soul out there is fire in his hand on my way to italy from the_

La magia aca, es que mientras menor sea la distancia entre 2 autores mejor se van a combinar las diferentes letras