In [28]:
import pandas as pd
import os

print('Get current working directory : ', os.getcwd())

#df = pd.read_csv('/Users/sima.sharifirad/career_designer_sima/Occupation Data 2023.csv')
df = pd.read_csv('Occupation Data 2023.csv')

Get current working directory :  /Users/aokoln/occupation-project


In [29]:
df.head(3)

Unnamed: 0,YEAR,CLASSID,CLASSID_NAME,AREAID,AREAID_NAME,OCCID,OCCID_NAME,EARN_AVG
0,2031,4,Extended Proprietors,1125010702,,51-7021,Furniture Finishers,0.0
1,2011,4,Extended Proprietors,17031470100,,43-4111,"Interviewers, Except Eligibility and Loan",31.011969
2,2016,3,Self-Employed,9009150300,,31-2022,Physical Therapist Aides,17.508876


### Task 1:

Step 1: Import the required libraries.

Step 2: Preprocess the text data and split the data into training and testing sets.

Step 3: Vectorize the 'OCCID_NAME' text data using TF-IDF.

Step 4: Train an SVM classifier on the TF-IDF vectors.

Step 5: Evaluate the model on the testing set.

In [30]:
# Step 1: Import the required libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [31]:
# Step 2: Preprocess the text data and split the data into training and testing sets.

# Step 2a: Load the DataFrame and split the data into features and labels.
# Assuming you already have a DataFrame named 'df' with columns 'OCCID_NAME' and 'CLASSID_NAME'

X = df['OCCID_NAME']
y = df['CLASSID_NAME']

#### random_state:

The random_state parameter is used for reproducibility. 
In machine learning algorithms like *train_test_split*, random processes are involved,
such as random shuffling of data during train-test splitting. 

Setting random_state to a fixed integer ensures that the random processes are performed in a predictable manner, 
and the same results can be reproduced each time the code is executed with the same value of random_state.

This is helpful for debugging and obtaining consistent results in experiments.

In [32]:
# Step 2b: Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### TfidfVectorizer:

TfidfVectorizer is a class from the *scikit-learn* library that is used for converting text data into
a numerical format called **Term Frequency-Inverse Document Frequency (TF-IDF)** vectors. 
It calculates the *TF-IDF* value for each word (term) in the text data, 
which is a measure of how important a word is in a particular document relative to the entire corpus of documents. 

*TF-IDF* helps to represent the text data in a way that emphasizes important words 
while downplaying common and uninformative words.


#### fit_transform:

*fit_transform* is a method in scikit-learn's vectorizers (like *TfidfVectorizer*) that combines two steps: 
1. fitting the vectorizer to the training data and 
2. transforming the training data into TF-IDF vectors. 

In the code, *fit_transform* is used on the training *data X_train* to learn 
the vocabulary and IDF statistics from the training data and 
then transform it into TF-IDF vectors represented as *X_train_tfidf*. 

The same vocabulary and IDF statistics learned from the training data are later used 
to transform the testing data into TF-IDF vectors using the transform method.


#### tfidf_vectorizer:

*tfidf_vectorizer* is an instance of the *TfidfVectorizer* class. 

It is created with specific parameters like *max_features* and *stop_words* to customize the vectorization process.

Once *tfidf_vectorizer* is created, it can be used to transform the text data (both training and testing) 
into TF-IDF vectors using *fit_transform* and *transform* methods, respectively.

In [37]:
# Step 3: Vectorize the 'OCCID_NAME' text data using TF-IDF

tfidf_vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
# You can adjust max_features as needed

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

X_test_tfidf = tfidf_vectorizer.transform(X_test)


###### Having an issue with understanding data types and 
###### previewing what the data looks like in this code block:

#X_train_tfidf.head(3)
type(X_train_tfidf)
X_train_tfidf

<80x198 sparse matrix of type '<class 'numpy.float64'>'
	with 278 stored elements in Compressed Sparse Row format>

#### TF-IDF an example:

We created a *TfidfVectorizer* instance without specifying any parameters. 
This means the vectorizer uses the default settings, 
including tokenization, lowercase conversion, and removing English stop words.

We used *fit_transform* to transform the documents into TF-IDF vectors represented as *tfidf_matrix*.

We obtained the feature names (words) learned by the vectorizer using *get_feature_names_out()*.

We printed the *TF-IDF scores* for each term in each document.
The TF-IDF score represents the importance of a term (word) in a specific document. 
Higher values indicate higher importance relative to the entire corpus.

By analyzing the TF-IDF scores, you can observe the following:

- Common words like "the," "is," "and," "a" have low TF-IDF scores, because they appear in multiple documents and are less informative for distinguishing between documents.

- Less common words like "jumps," "lazy," "quick," "brown," "good," and "friends" have higher TF-IDF scores, because they are more specific to individual documents and provide more discriminative power.

This simple analysis helps in understanding how TF-IDF works to 
represent text data as numerical vectors and highlights the importance of individual terms in different documents. 

tfidf link: <https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus of documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "A fox and a dog are good friends",
    "The dog is lazy but the fox is quick"
]

# Step 1: Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Transform the documents into TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get the feature names (words) learned by the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Step 3: Print the TF-IDF scores for each term in each document
for doc_idx, doc in enumerate(documents):
    print(f"Document {doc_idx + 1}:")
    for term_idx, term in enumerate(feature_names):
        tfidf_score = tfidf_matrix[doc_idx, term_idx]
        if tfidf_score > 0:
            print(f"{term}: {tfidf_score:.3f}")
    print("-" * 40)

#### SVM (Support Vector Machine):

SVM is a popular supervised machine learning algorithm used for both classification and regression tasks. 
In the context of the code, we are using SVM for text classification, 
where each document (represented by its TF-IDF vector) is associated with a class label.

The main idea behind SVM is to find a hyperplane that best separates the data into different classes. 

- For *binary classification*, the hyperplane separates the data into two classes by maximizing the margin (distance) between the classes. 

- For *multi-class classification*, like in this example, SVM uses various techniques (e.g., One-vs-Rest, One-vs-One) to handle multiple classes.


#### kernel:

The kernel parameter in SVM determines the *type of function used to transform the input data into 
a higher-dimensional space*, where it becomes easier to find a separating hyperplane. 

Different kernel functions can be used, such as 'linear', 'poly', 'rbf' (Radial Basis Function),'sigmoid', etc.

In the code, we used the 'linear' kernel, which implies a linear hyperplane for separation. 
This kernel is suitable for cases where the data can be effectively separated by a straight line or plane. 

Other kernel functions can be used depending on the nature of the data and the problem.


#### C:

The C parameter is the *regularization parameter* in SVM. 
It controls the trade-off between maximizing the margin (increasing C) and 
minimizing the classification error on the training data.

A smaller value of C allows more misclassifications on the training data, which can lead to a wider margin, 
while a larger C value reduces the margin to correctly classify more training points.

In the code, we set C=1.0, which is a common default value. 

You can experiment with different values of C to find the best trade-off between 
margin and classification accuracy.


Overall, the SVM classifier with a specified kernel and C value aims to 
find an *optimal hyperplane* that effectively separates the data into different classes based on 
the TF-IDF vectors derived from the text data. 

It is one of the classical and widely used methods fortext classification tasks.

In [26]:
# Step 4: Train an SVM classifier on the TF-IDF vectors

svm_classifier = SVC(kernel='linear', C=1.0)  # You can try different kernels and hyperparameters as well
svm_classifier.fit(X_train_tfidf, y_train)

###### Need to change these for the task

SVC(kernel='linear')

In [27]:
# Step 5: Evaluate the model on the testing set

# ground truth-golden labels-y labels
y_pred = svm_classifier.predict(X_test_tfidf)

# Calculate accuracy and print classification report
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

classification_rep = classification_report(y_test, y_pred)

print("Classification Report:")
print(classification_rep)

Accuracy: 0.25
Classification Report:
                      precision    recall  f1-score   support

Extended Proprietors       0.00      0.00      0.00         5
  Non-QCEW Employees       0.00      0.00      0.00         5
      QCEW Employees       0.33      0.80      0.47         5
       Self-Employed       0.50      0.20      0.29         5

            accuracy                           0.25        20
           macro avg       0.21      0.25      0.19        20
        weighted avg       0.21      0.25      0.19        20



#### Macro Average:

Macro average calculates the *unweighted average of a metric* (e.g., precision, recall, F1-score) across all classes.

It treats each class equally, regardless of the class's size or number of samples. 
It is useful when you want to assess the model's performance without considering class imbalances. 

Macro average gives equal importance to each class, so it can be useful when all classes are equally important.

#### Weighted Average:

Weighted average, on the other hand, calculates the *average of a metric while considering the class's support* 
(i.e., the number of samples belonging to that class). It gives more weight to classes with more samples, 
which means larger classes have a greater impact on the overall average. 

Weighted average is useful when there are class imbalances, and you want to account for 
the fact that some classes have more impact on the overall performance.

#### Support:

The support is simply the *number of samples in each class*. 

It provides information about the distribution of the classes in the test dataset. 
For instance, if some classes have very few samples, 
it may indicate that the model's performance on those classes might be less reliable due to limited data.


When evaluating a multi-class classifier, you typically get classification metrics like 
*precision*, *recall*, and *F1-score* for each class separately. 

Macro average and weighted average provide a way to summarize these class-wise metrics and 
get an overall understanding of the model's performance across all classes.

### Possible problems:

#### Class Imbalance:

If the training data is heavily imbalanced, where one or more classes have very few samples compared to the others,
the model may struggle to learn the characteristics of the minority classes, 
resulting in poor performance on those classes.

#### Insufficient Data:

If a particular class has very few training samples or is not well-represented in the training data, 
the model may not be able to generalize well for that class during testing.

#### Overfitting:

If the model is overfitting to the majority classes, it may ignore the patterns in the minority classes, 
leading to poor performance on those classes.

#### Model Complexity:

The chosen model may not be suitable for the specific dataset, or 
its complexity might not match the data's complexity. 
Different algorithms or model configurations might perform better on the given dataset."""

### Solutions:

#### Data Augmentation:

If possible, consider augmenting the data for the underrepresented classes to increase the number of training samples.

#### Class Balancing Techniques:

Use techniques like *oversampling* (e.g., *SMOTE8* - Synthetic Minority Over-sampling Technique) or 
*undersampling* to balance the class distribution.

#### Hyperparameter Tuning:

Experiment with different hyperparameter settings for the model, including *regularization*, *kernel type*, and *complexity*, to find the optimal configuration for better performance on all classes.

#### Model Selection: 

Try using different algorithms and compare their performance to determine which one is better suited for your dataset.

#### Error Analysis: 

Conduct an error analysis to understand the types of mistakes the model is making for the problematic classes. 
This insight can help identify the root causes and guide improvements.

# Next step for Anna:

**Random Forest:** Random Forest is an ensemble learning method that uses multiple decision trees to make predictions. It is robust, handles non-linear relationships well, and can handle large datasets effectively.

In [None]:
# Step 1: Import the required libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import SVC
# from sklean.random_forest import random forest ###### Need to figure this out
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Step 2: Preprocess the text data and split the data into training and testing sets.

###### We already have the slipt from earlier

In [None]:
# Step 3: Vectorize the 'OCCID_NAME' text data using TF-IDF

###### We already vectoreized above.

In [None]:
# Step 4: Train an Random Forest classifier on the TF-IDF vectors

#svm_classifier = SVC(kernel='linear', C=1.0)
#svm_classifier.fit(X_train_tfidf, y_train)

###### Need to figure this out fro Random Forest

In [None]:
# Step 5: Evaluate the model on the testing set ###### Copied from above to be modified

# ground truth-golden labels-y labels
#y_pred = svm_classifier.predict(X_test_tfidf) ###### Modify this line

# Calculate accuracy and print classification report
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

classification_rep = classification_report(y_test, y_pred)

print("Classification Report:")
print(classification_rep)

**Decision Trees:** Decision Trees are simple and interpretable classifiers that can handle both numerical and categorical data. However, they may overfit the data if not pruned properly.

**Gradient Boosting:** Gradient Boosting is an ensemble technique that combines multiple weak learners (usually decision trees) to create a powerful classifier. It often performs well and can handle complex relationships.

**K-Nearest Neighbors (KNN):** KNN is a lazy learning algorithm that classifies samples based on 

**ensemble models:** Decision Trees + K-Nearest Neighbors (KNN) the majority class of their k nearest neighbors in the feature space. It is simple and can handle non-linear decision boundaries.

**Naive Bayes:** Naive Bayes is based on Bayes' theorem and assumes that features are conditionally independent given the class label. It is efficient and works well with high-dimensional data.