In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

In [2]:
startups_df = pd.read_excel('startups.xlsx', sheet_name='Request 2')

In [3]:
startups_df.head(5)

Unnamed: 0,ID,Company Name,Date,Description,Industry,Industry2
0,1,Enclarity Inc,2005-01-01,"Enclarity, Inc. is a United States-based healt...",Information Technology,Computer Software and Services
1,2,Ocean Entertainment Inc,2014-01-16,Ocean Entertainment Inc. is introducing the fi...,Non-High Technology,Consumer Related
2,3,Ocean Entertainment Inc,2014-01-16,Ocean Entertainment Inc. is introducing the fi...,Non-High Technology,Consumer Related
3,4,Hengyang Jinzeli Special Allop Co Ltd,1999-12-01,"Hengyang Jinzeli Special Alloy Co., Ltd. is a ...",Non-High Technology,Industrial/Energy
4,5,Verge Solutions LLC,2001-01-01,Verge Solutions LLC is a United States-based c...,Information Technology,Computer Software and Services


In [4]:
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Assuming your descriptions are in a column named 'description'
tfidf_matrix = vectorizer.fit_transform(startups_df['Description'])

In [5]:
print(tfidf_matrix.shape)

(60089, 127542)


In [11]:
# Define the model
model = lgb.LGBMClassifier()
# Define the parameter grid
param_grid = {
    'num_leaves': [100, 150],
    'learning_rate': [0.1, 0.5],
    'n_estimators': [300, 400, 500],
}
# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)
X = tfidf_matrix  # The feature matrix
y = startups_df['Industry'] 
# Fit GridSearchCV on the entire dataset
grid_search.fit(X, y)
# Best cross-validation score
best_cv_score = grid_search.best_score_
print(f'Best Cross-Validation Accuracy: {best_cv_score}')

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.230941 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.290883 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.310750 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.333984 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.36

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.348299 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.793105 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.282633 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.469831 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 4.202497 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396






[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.465430 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.671910 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396






[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.710480 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.471277 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417






[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.493705 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396






[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.377588 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396






[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.375809 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396






[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.378590 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477






[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.266662 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.431564 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396










[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.310488 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396












[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.345669 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396












[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.286981 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477












[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.395822 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417














[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.285373 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396










[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.257004 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396










[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.269677 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396












[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.280362 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477












[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.301636 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417












[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.262085 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 339105
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7077
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396
















[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.299725 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 380432
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 7759
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396


















[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.299365 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 402332
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8089
[LightGBM] [Info] Start training from score -0.514041
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350396


















[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.349477 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 420035
[LightGBM] [Info] Number of data points in the train set: 48071, number of used features: 8361
[LightGBM] [Info] Start training from score -0.514007
[LightGBM] [Info] Start training from score -1.946389
[LightGBM] [Info] Start training from score -1.350477
















[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.347069 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 435932
[LightGBM] [Info] Number of data points in the train set: 48072, number of used features: 8593
[LightGBM] [Info] Start training from score -0.514027
[LightGBM] [Info] Start training from score -1.946410
[LightGBM] [Info] Start training from score -1.350417




















[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.452613 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 470075
[LightGBM] [Info] Number of data points in the train set: 60089, number of used features: 9056
[LightGBM] [Info] Start training from score -0.514032
[LightGBM] [Info] Start training from score -1.946393
[LightGBM] [Info] Start training from score -1.350417
Best Cross-Validation Accuracy: 0.8767326888228585


In [12]:
# Best parameters
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {'learning_rate': 0.1, 'n_estimators': 500, 'num_leaves': 150}


In [13]:
# Best cross-validation score
best_cv_score = grid_search.best_score_
print(f'Best Cross-Validation Accuracy: {best_cv_score}')

Best Cross-Validation Accuracy: 0.8767326888228585
