- To apply Word2Vec for predicting essay scores, we'll use the gensim library to create word embeddings. 
Given that we are using lemmatized text without stop words, punctuation, and special characters for Word2Vec embeddings, this preprocessing is typically beneficial as Word2Vec focuses on learning the semantic meaning and relationships between words.
- For Word2Vec, it is generally recommended to convert text to lowercase before training the model. This step ensures that words like "Apple" and "apple" are treated as the same word, which helps in reducing the vocabulary size and improves the model's ability to learn meaningful representations. 

### Content
  - [1. Get Data](#1-Get-Data)
  - [2. Analyze Text Length]()
  - [3. Conclusion (explanation of parameters)]()
  - [4. Train the Word2Vec Model]()

In [65]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('transformed_data_v1.csv')
df_numerical = pd.read_csv('numeric_features_added_v1.csv')

### 2. Analyze Text Length:

In [67]:
# Choose the text column (lemmatized text without stop words, punctuation, and special characters)
text_column = 'clean_lemm_preprocessed_text'

# Convert text to lowercase
df[text_column] = df[text_column].str.lower()

# Analyze text lengths
df['text_length'] = df[text_column].apply(lambda x: len(x.split()))

# Calculate minimum, maximum, and average length
min_length = df['text_length'].min()
max_length = df['text_length'].max()
average_length = df['text_length'].mean()

print(f'Minimum text length: {min_length}')
print(f'Maximum text length: {max_length}')
print(f'Average text length: {average_length}')


Minimum text length: 57
Maximum text length: 840
Average text length: 183.58469984829878


In [68]:
# Analyze average text lengths
print("\n" + "-"*50 + "\n")
print(df_numerical['avg_sentence_length'].describe())
print("\n" + "-"*50 + "\n")

# Tokenize the sentences into words
sentences = df['clean_lemm_preprocessed_text'].apply(lambda x: x.split())

# Calculate vocabulary size
vocab = Counter(word for sentence in sentences for word in sentence)
vocab_size = len(vocab)
print(f'Vocabulary size: {vocab_size}')


--------------------------------------------------

count    13843.000000
mean        20.394512
std         13.308014
min          5.354167
25%         15.800000
50%         18.578947
75%         22.200000
max        715.000000
Name: avg_sentence_length, dtype: float64

--------------------------------------------------

Vocabulary size: 54074


### 3. Conclusion (explanation of parameters)

Based on the above information about the dataset, here are parameters for the Word2Vec model with explained reasoning behind these choices:
- **vector_size=300:** This larger size helps capture more semantic nuances, which is beneficial for a diverse and rich vocabulary.
- **window=5:** A window size of 5 is sufficient to capture the context within the average sentence length of approximately 20 words.
- **min_count=10:** Filters out infrequent words, reducing noise and focusing on more common and likely more informative words.
- **workers=32:** Utilizes all 32 CPU cores to speed up the training process, making efficient use of the available computational resources.    

### 4. Train the Word2Vec Model

In [72]:
import pandas as pd
import gensim
from gensim.models import Word2Vec
import numpy as np

# Load the dataset
df = pd.read_csv('transformed_data_v1.csv')

# Choose the text column (lemmatized text without stop words)
text_column = 'clean_lemm_preprocessed_text'

# Convert text to lowercase
df[text_column] = df[text_column].str.lower()

# Prepare the data for Word2Vec
# Tokenize the sentences into words
sentences = df[text_column].apply(lambda x: x.split())

# Train the Word2Vec model with improved parameters
w2v_model = Word2Vec(sentences, vector_size=300, window=5, min_count=10, workers=32)

# Create a function to average word vectors for a given sentence
def average_word_vectors(sentence, model, vector_size):
    words = sentence.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(word_vectors, axis=0)

# Apply the function to the text column to get sentence embeddings
vector_size = w2v_model.vector_size
df['word2vec_embedding'] = df[text_column].apply(lambda x: average_word_vectors(x, w2v_model, vector_size))

# Check the result
print(df[['word2vec_embedding']].head())

# Save the updated DataFrame
df.to_csv('word2vec_features.csv', index=False)


                                  word2vec_embedding
0  [-0.09748275, 0.005645593, 0.33820832, 0.30875...
1  [0.0027511492, 0.08209037, -0.15495926, 0.0202...
2  [0.044071138, -0.09524269, 0.19774339, -0.2092...
3  [0.0908097, -0.0639152, 0.35811678, 0.23761816...
4  [0.034144007, -0.29616755, 0.12020076, -0.2252...
