# The Beatles Machine Learning Data set

Data for use on Machine Learning Model to predict those who will listen The Beatles based on other artists. There are two data sets of importance:

* [file_out_2495.csv](../answers/file_out_2495.csv) a list of users who listened to at least 1 of the most 300 played artists. The columns are the play counts for each artists mentioned. The target is "Likes the Beatles"
* file_out_2495_tags.csv Same as above but with also a count of the genre distribution. We will show you how to get this from cloud storage below.


## 1. how to get from cloud storage


```python

# first import the client
from google.cloud import storage

# the file path is:
# gs://amazing-public-data/beatles/file_out_2495_tags.csv
# we can break that out and where we want it to go
bucket_name = "amazing-public-data"
blob_name = "beatles/file_out_2495_tags.csv"
destination_file_name = "../answers/file_out_2495_tags.csv"

# first we start a gcs Client 
storage_client = storage.Client()

# then we open a handle to the bucket
bucket = storage_client.bucket(bucket_name)
# we access the actual object, which is a file
blob = bucket.blob(blob_name)
# then we write that file to disc for using
blob.download_to_filename(destination_file_name)


```

## 2. Analyize the source of the data.

The data comes from BigQuery public data sources. We will learn here how to query that data and take a look at it in this Jupyter notebook.

One way to query the data using the magic `bigquery` in cells like so:

```python
%%bigquery artists_df
SELECT      artist_name, COUNT(*)
    FROM     `listenbrainz.listenbrainz.listen`
    GROUP BY artist_name
    ORDER BY COUNT(*) DESC
    LIMIT   300;

```

This will give us the top 300 listened to artists.

> note: if you get "The default BigQuery Storage API client cannot be used, install the missing google-cloud-bigquery-storage and pyarrow packages to use it." Error you can install it like so:

```
!pip install google-cloud-bigquery-storage pyarrow
```



# 3. looking at the data

The name afer bigquery is the dataframe name. That is a 'pandas' data frame you can read more about pandas here https://pandas.pydata.org/

if you type the name of the pandas data frame in jupyter and hit return you will see some rows (please note your output may be different due to this data being updated constantly and the execution numbers to the left):

![Pandas Output](./img/panda_example.png)




## 4. exploring the input data set

we used this data set above combined with last.fm's tags about each artist. The overlap of tags is informative data about what other artists one would like.

```python
import pandas as pd
df_labelled = pd.read_csv("../answers/file_out_2495_tags.csv")
df_labelled[['user_name', 'tag_60s', 'Like_The_Beatles']]
```

![Pandas Output](./img/panda_example2.png)



## EDA with help from some friends

### Sweetviz


Suggested packages you install:

```
!pip install sweetviz ydata-profiling
```



```python

import sweetviz as sv

train = df_labelled.sample(frac=0.8,random_state=200)
test = df_labelled.drop(train.index)
comparison = sv.compare([train, "Train"], [test, "Test"], target_feat="Like_The_Beatles", pairwise_analysis='off')

comparison.show_notebook(layout="vertical", w=800, h=700, scale=0.8)
```

![Sweet Viz](./img/sweetviz.png)


### ydata profiling (was called pandas profiling)

To generate a Python Profile report:

```python
from ydata_profiling import ProfileReport
prof = ProfileReport(df_labelled, correlations=None, minimal=True)
prof
```

![Sweet Viz](./img/panda_profile.png)

