Your name here:

# Assignment

* Students are expected to obtain, clean, and create embeddings for a corpus of their choice.

* Some sample ideas are provided.

* Models would be nice to have, but not required.

* Create additional visualizations with your own conclusions - time permitting

* Key part is to show your own thinking when working with the data/corpus.

## Submit:

You submit a single notebook (Jupyter .ipynb extension) with your work.
Inside should be all the steps that you took to obtain data, clean it, and/or create embeddings and visualizations.

* [Google Form](https://forms.gle/XJjYPHuDiPjiv6XZ8)

Reminder: To qualify for ECTS credit points you have to submit at least two good assignments out of 5 total.

You can submit even if you don't qualify for ECTS credit points. I will look at all submissions.



## Some possible data sources

* [Project Gutenberg](https://www.gutenberg.org/) - large collection of public domain books
* [Kaggle](https://www.kaggle.com/datasets) - large collection of datasets
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) - large collection of datasets
* [Google Dataset Search](https://toolbox.google.com/datasetsearch) - large collection of datasets
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets) - curated list of datasets
* [Data is Plural](https://tinyletter.com/data-is-plural) - weekly newsletter of interesting datasets
* [Data.world](https://data.world/) - large collection of datasets
* [European Data Portal](https://www.europeandataportal.eu/en) - large collection of datasets
* [CLARIN](https://www.clarin.eu/content/data) - large collection of datasets

Suggestion take either small data source or a subset of a large data source. Try to keep your corpus under 20MB for this exercise.


## Minimum requirements for work would be to obtain, clean, and create embeddings for a corpus of your choice

Embeddings mean creating some sort of numerical representation of your text data.  This could be a simple bag of words, or something more complex like TFIDF or even like word2vec or doc2vec.  You can use any library you want to create the embeddings.  You can use any library you want to visualize the embeddings.  

## present/visualize top 5 words for each document - up to you how you define "top"

Create a histogram of top words over the corpus. Limit it to 50 words.  You can use any library you want to create the histogram.  You can use any library you want to visualize the histogram.



## Ideas for visualizations

* [Pandaas Built in Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
* [Plotly](https://plot.ly/python/)
* [seaborn](https://seaborn.pydata.org/)

* [Word Cloud](https://www.datacamp.com/community/tutorials/wordcloud-python)
* [Scattertext](https://github.com/JasonKessler/scattertext#visualizing-phrase-associations)


## Conclusion

I want to see your own ideas and conclusions. Even if you do not write much Python code, I want to see thought process on corpus selection, cleaning and futher analysis.

## Submission

* Submit your notebook using Google forms that will be available on the day of the class.

In [1]:
# prompt: Let's get import needed Python libraries and download data from the url: "http://peeter.eki.ee:5000/valence/exportparagraphs" and save the result as CSV file. Then load the CSV file into dataframe.

import pandas as pd
import requests

# Download data from the URL
url = "http://peeter.eki.ee:5000/valence/exportparagraphs"
response = requests.get(url)

# Save the result as a CSV file
with open("valence_data.csv", "wb") as f:
  f.write(response.content)

# Load the CSV file into a dataframe
df = pd.read_csv("valence_data.csv")

# head
df.head()


Unnamed: 0,category,sourceurl,nr,valence,paragraph
0,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,1.0,negatiivne,Enam kui kümme aastat tagasi tegutses huumoris...
1,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,2.0,vastuoluline,Neid ridu kirjutades tundub isegi ebaviisakas ...
2,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,3.0,positiivne,Isiklikult kohtasin natukegi Kukekese moodi po...
3,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,4.0,vastuoluline,"Olen näinud ka, kuidas patrull korrarikkujat t..."
4,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,5.0,negatiivne,Kummaline on nüüd äkki lugeda politsei ja sise...


In [2]:
# prompt: Using dataframe df: create bag of word embedding for paragraph column and save it into embedding column in the dataframe

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the 'paragraph' column
embeddings = vectorizer.fit_transform(df['paragraph'])

# Create a new column 'embedding' in the DataFrame
df['embedding'] = list(embeddings.toarray())

# head
df.head()


Unnamed: 0,category,sourceurl,nr,valence,paragraph,embedding
0,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,1.0,negatiivne,Enam kui kümme aastat tagasi tegutses huumoris...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,2.0,vastuoluline,Neid ridu kirjutades tundub isegi ebaviisakas ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,3.0,positiivne,Isiklikult kohtasin natukegi Kukekese moodi po...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,4.0,vastuoluline,"Olen näinud ka, kuidas patrull korrarikkujat t...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,ARVAMUS,http://arvamus.postimees.ee/1001520/anvar-samo...,5.0,negatiivne,Kummaline on nüüd äkki lugeda politsei ja sise...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [3]:
# prompt: Using dataframe df: show minum and maximum number of elements in embedding column cells

# Calculate the number of elements in each embedding
df['embedding_length'] = df['embedding'].apply(len)

# Find the minimum and maximum number of elements
min_length = df['embedding_length'].min()
max_length = df['embedding_length'].max()

# Print the results
print("Minimum number of elements in embedding:", min_length)
print("Maximum number of elements in embedding:", max_length)


Minimum number of elements in embedding: 33856
Maximum number of elements in embedding: 33856


In [5]:
# shape of df
df.shape

(4088, 7)

In [4]:
# prompt: Find what values are present in embedding column of df

# Get all unique values in the 'embedding' column
all_values = set()
for embedding in df['embedding']:
  for value in embedding:
    all_values.add(value)

# Print the unique values
print("Unique values in 'embedding' column:", all_values)


Unique values in 'embedding' column: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}


In [None]:
#