## 🎯 Problem Statement

The goal of this notebook is to **predict whether a job posting is hourly or fixed-price** based on various job-related features.

This is a **binary classification problem**, where the target variable is:
- `is_hourly`:  
  - `True`  → Hourly Job  
  - `False` → Fixed-price Job

By analyzing job attributes (e.g., job title, description, skills, country, etc.), we aim to build a model that can accurately classify the job type, which can assist platforms and freelancers in understanding job trends and making informed decisions.


## Importing Required Libraries

In [3]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler , FunctionTransformer ,OneHotEncoder 
from sklearn.impute import KNNImputer , SimpleImputer
from sklearn.model_selection import cross_validate , StratifiedKFold , GridSearchCV , train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier , BaggingClassifier , AdaBoostClassifier , GradientBoostingClassifier , RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.feature_selection import SelectFromModel , SequentialFeatureSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score , precision_score , recall_score ,f1_score ,confusion_matrix ,ConfusionMatrixDisplay
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
import cloudpickle

## Read DataSet

In [4]:
df = pd.read_csv('upwork-jobs.csv')
df

Unnamed: 0,title,link,description,published_date,is_hourly,hourly_low,hourly_high,budget,country
0,Experienced Media Buyer For Solar Pannel and R...,https://www.upwork.com/jobs/Experienced-Media-...,We’re looking for a talented and hardworking a...,2024-02-17 09:09:54+00:00,False,,,500.0,
1,Full Stack Developer,https://www.upwork.com/jobs/Full-Stack-Develop...,Job Title: Full Stack DeveloperWe are seeking ...,2024-02-17 09:09:17+00:00,False,,,1100.0,United States
2,SMMA Bubble App,https://www.upwork.com/jobs/SMMA-Bubble-App_%7...,I need someone to redesign my bubble.io site t...,2024-02-17 09:08:46+00:00,True,10.0,30.0,,United States
3,Talent Hunter Specialized in Marketing,https://www.upwork.com/jobs/Talent-Hunter-Spec...,Join Our Growing Team!We are an innovative com...,2024-02-17 09:08:08+00:00,,,,,United States
4,Data Engineer,https://www.upwork.com/jobs/Data-Engineer_%7E0...,We are looking for a resource who can work par...,2024-02-17 09:07:42+00:00,False,,,650.0,India
...,...,...,...,...,...,...,...,...,...
53053,Partial Migration From WordPress to Shopify,https://www.upwork.com/jobs/Partial-Migration-...,We're moving from Wordpress to Shopify. The Sh...,2024-02-14 06:40:39+00:00,False,,,150.0,Australia
53054,Logo work &amp; Event Booth Rendering,https://www.upwork.com/jobs/Logo-work-amp-Even...,I need some art works rendered in to booth des...,2024-02-14 06:40:26+00:00,False,,,30.0,United States
53055,Wedding Dress Collection Photographer,https://www.upwork.com/jobs/Wedding-Dress-Coll...,We are looking for a skilled photographer to c...,2024-02-14 06:40:06+00:00,True,23.0,51.0,,Australia
53056,Design a startup profile,https://www.upwork.com/jobs/Design-startup-pro...,I building a startup company and I want to des...,2024-02-14 06:40:06+00:00,False,,,70.0,Saudi Arabia


### Dropping Irrelevant or Redundant Columns


In [5]:
df.drop(['link', 'published_date',
       'hourly_low', 'hourly_high', 'budget', 'country'] , axis=1 , inplace=True)

###  Combining Title and Description into a Single Text Field

In [6]:
df['text'] = df['title'].fillna('') + ' ' + df['description'].fillna('')

In [7]:
df.drop(['title','description'] , axis=1 , inplace=True)

In [8]:
df

Unnamed: 0,is_hourly,text
0,False,Experienced Media Buyer For Solar Pannel and R...
1,False,Full Stack Developer Job Title: Full Stack Dev...
2,True,SMMA Bubble App I need someone to redesign my ...
3,,Talent Hunter Specialized in Marketing Join Ou...
4,False,Data Engineer We are looking for a resource wh...
...,...,...
53053,False,Partial Migration From WordPress to Shopify We...
53054,False,Logo work &amp; Event Booth Rendering I need s...
53055,True,Wedding Dress Collection Photographer We are l...
53056,False,Design a startup profile I building a startup ...


### Make a quick EDA

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53058 entries, 0 to 53057
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   is_hourly  44829 non-null  object
 1   text       53058 non-null  object
dtypes: object(2)
memory usage: 829.2+ KB


In [10]:
df.isnull().sum()

is_hourly    8229
text            0
dtype: int64

In [11]:
df.dropna(inplace=True , ignore_index=True)

###  Encoding Target Variable (is_hourly)

In [12]:
df['is_hourly'] = df['is_hourly'].apply(lambda x :1 if x==True else 0)

###  Splitting Features and Target

In [13]:
x , y = df[['text']] , df['is_hourly']

###  Text Cleaning Function

In [14]:
def clean_text_series(X):
    def clean(x):
        if not isinstance(x, str):
            return ""
        x = x.lower()
        x = x.strip()
        x = x.replace('\n', ' ')
        return x
    return X.apply(clean)

###  Ensuring Text Column is Clean and String Type

In [46]:
df['text'] = df['text'].fillna("").astype(str)

### Make a list of models to make a model selection of the best model accuracy

In [15]:
models = [
    LogisticRegression(max_iter=100),
    KNeighborsClassifier(),
    SVC(),
    GaussianNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    XGBClassifier()
]

In [16]:
scoring = {
    'accuracy':'accuracy',
    'precision':'precision',
    'recall':'recall',
    'f1':'f1'
}

### Make a Pipeline to structure the code

In [17]:
pl = make_pipeline(FunctionTransformer(clean_text_series),TfidfVectorizer(stop_words='english' , max_features=5000) ,
                    SelectFromModel(DecisionTreeClassifier()),
                   LogisticRegression(max_iter=100))
pl

###  Cross-Validation for Model Evaluation

In [18]:
cv = cross_validate(estimator=pl , X=df['text'] , y = y , cv=5 , scoring=scoring , return_train_score=True)

In [19]:
print(f'ACC of train {cv['train_accuracy'].mean()}')
print(f'ACC of test {cv['test_accuracy'].mean()}')
print(f"Mean of Recall for Train is : {cv['train_recall'].mean():.4f}")
print(f"Mean of Recall for Test is : {cv['test_recall'].mean():.4f}")
print(f"Mean of Precision for Train is : {cv['train_precision'].mean():.4f}")
print(f"Mean of Precision for Test is : {cv['test_precision'].mean():.4f}")
print(f"Mean of F1 for Train is : {cv['train_f1'].mean():.4f}")
print(f"Mean of F1 for Test is : {cv['test_f1'].mean():.4f}")

ACC of train 0.992259475887708
ACC of test 0.992527176161191
Mean of Recall for Train is : 0.9869
Mean of Recall for Test is : 0.9875
Mean of Precision for Train is : 0.9979
Mean of Precision for Test is : 0.9979
Mean of F1 for Train is : 0.9924
Mean of F1 for Test is : 0.9927


###  Model Comparison with Cross-Validation

In [20]:
for model in models:
    pl = make_pipeline(FunctionTransformer(clean_text_series) , TfidfVectorizer(stop_words='english' , max_features=5000),
                       FunctionTransformer(lambda x: x.toarray() , accept_sparse=True),
                    SelectFromModel(DecisionTreeClassifier()), model)
    cv = cross_validate(estimator=pl , X=df['text'] , y = y , cv=5 , scoring=scoring , return_train_score=True , error_score='raise')
    print(model)
    print(f'ACC of train {cv['train_accuracy'].mean()}')
    print(f'ACC of test {cv['test_accuracy'].mean()}')
    print(f"Mean of Recall for Train is : {cv['train_recall'].mean():.4f}")
    print(f"Mean of Recall for Test is : {cv['test_recall'].mean():.4f}")
    print(f"Mean of Precision for Train is : {cv['train_precision'].mean():.4f}")
    print(f"Mean of Precision for Test is : {cv['test_precision'].mean():.4f}")
    print(f"Mean of F1 for Train is : {cv['train_f1'].mean():.4f}")
    print(f"Mean of F1 for Test is : {cv['test_f1'].mean():.4f}")
    print('-'*100)

LogisticRegression()
ACC of train 0.9921981317929548
ACC of test 0.9925271736730157
Mean of Recall for Train is : 0.9868
Mean of Recall for Test is : 0.9875
Mean of Precision for Train is : 0.9979
Mean of Precision for Test is : 0.9979
Mean of F1 for Train is : 0.9923
Mean of F1 for Test is : 0.9927
----------------------------------------------------------------------------------------------------
KNeighborsClassifier()
ACC of train 0.9994925161295998
ACC of test 0.9989515774968932
Mean of Recall for Train is : 0.9994
Mean of Recall for Test is : 0.9987
Mean of Precision for Train is : 0.9996
Mean of Precision for Test is : 0.9993
Mean of F1 for Train is : 0.9995
Mean of F1 for Test is : 0.9990
----------------------------------------------------------------------------------------------------
SVC()
ACC of train 0.9996709717910619
ACC of test 0.9994869432381288
Mean of Recall for Train is : 0.9994
Mean of Recall for Test is : 0.9992
Mean of Precision for Train is : 1.0000
Mean of Prec



AdaBoostClassifier()
ACC of train 0.9999609627074317
ACC of test 0.9994423327439261
Mean of Recall for Train is : 0.9999
Mean of Recall for Test is : 0.9993
Mean of Precision for Train is : 1.0000
Mean of Precision for Test is : 0.9996
Mean of F1 for Train is : 1.0000
Mean of F1 for Test is : 0.9995
----------------------------------------------------------------------------------------------------
GradientBoostingClassifier()
ACC of train 0.9998605813201102
ACC of test 0.9995538651998708
Mean of Recall for Train is : 0.9997
Mean of Recall for Test is : 0.9993
Mean of Precision for Train is : 1.0000
Mean of Precision for Test is : 0.9998
Mean of F1 for Train is : 0.9999
Mean of F1 for Test is : 0.9996
----------------------------------------------------------------------------------------------------
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_r

In [21]:
def dense_transform(X):
    return X.toarray()

###  Logistic Regression with Hyperparameter Tuning (GridSearchCV)

In [48]:
pl_logistic = make_pipeline(FunctionTransformer(clean_text_series) , 
                            TfidfVectorizer(stop_words='english' , max_features=5000) , 
                            FunctionTransformer(dense_transform,accept_sparse=True) ,
                            SelectFromModel(DecisionTreeClassifier()) , LogisticRegression(max_iter=100))
pl_logistic

In [50]:
pl_logistic.steps

[('functiontransformer-1',
  FunctionTransformer(func=<function clean_text_series at 0x0000024549642840>)),
 ('tfidfvectorizer', TfidfVectorizer(max_features=5000, stop_words='english')),
 ('functiontransformer-2',
  FunctionTransformer(accept_sparse=True,
                      func=<function dense_transform at 0x00000245465FE980>)),
 ('selectfrommodel', SelectFromModel(estimator=DecisionTreeClassifier())),
 ('logisticregression', LogisticRegression())]

In [52]:
lr_pram =[
    {
        'logisticregression__C':[0.1 , 0.5 , 1 , 5 , 10]
    }
]

In [54]:
lr_srch = GridSearchCV(estimator=pl_logistic , param_grid=lr_pram , scoring='accuracy' 
                       , cv = StratifiedKFold(n_splits=5) , return_train_score=True , error_score='raise')

In [59]:
x = df['text'].fillna("").astype(str)

In [61]:
lr_srch.fit(x , y)

In [63]:
lr_srch.best_estimator_

In [65]:
lr_srch.best_params_

{'logisticregression__C': 10}

In [67]:
lr_srch.best_score_

0.9952263163349079

###  SVM with Hyperparameter Tuning (GridSearchCV)

In [70]:
pl_svm = make_pipeline(FunctionTransformer(clean_text_series) ,
                            TfidfVectorizer(stop_words='english' , max_features=5000) , 
                            FunctionTransformer(lambda x:x.toarray() ,accept_sparse=True) ,
                            SelectFromModel(DecisionTreeClassifier()) , SVC())
pl_svm

In [72]:
svm_pram = [
     {
        'svc__kernel' : ['linear'],
        'svc__C' : [0.1 , 1 , 5]
    },
    {
        'svc__kernel' : ['poly'],
        'svc__C' : [0.1 , 1 , 5],
        'svc__degree' : [2,3]
    },
    {
        'svc__kernel' : ['rbf'],
        'svc__gamma' : [0.1 , 0.5 , 1 , 5 , 10],
    }
]

In [74]:
svm_srch = GridSearchCV(estimator=pl_svm , param_grid=svm_pram , scoring='accuracy' 
                       , cv = StratifiedKFold(n_splits=5) , return_train_score=True , error_score='raise')

In [76]:
svm_srch.fit(x , y)

In [78]:
svm_srch.best_estimator_

In [80]:
svm_srch.best_params_

{'svc__C': 5, 'svc__degree': 2, 'svc__kernel': 'poly'}

In [82]:
svm_srch.best_score_

0.9993977147851977

In [None]:
pl_rf = make_pipeline(FunctionTransformer(clean_text_series) , 
                      TfidfVectorizer(stop_words='english' , max_features=5000),
                      FunctionTransformer(lambda x:x.toarray() , accept_sparse=True) , SelectFromModel(DecisionTreeClassifier()),
                      RandomForestClassifier())
pl_rf

In [None]:
rf_params = {
    'randomforestclassifier__n_estimators': [100, 200],
    'randomforestclassifier__max_depth': [None, 10, 20],
    'randomforestclassifier__min_samples_split': [2, 5],
    'randomforestclassifier__min_samples_leaf': [1, 2],
    'randomforestclassifier__max_features': ['sqrt', 'log2']
}

In [None]:
rf_srch = GridSearchCV(estimator=pl_rf , param_grid=rf_params , scoring='accuracy' , cv = StratifiedKFold(n_splits=5)
                        ,return_train_score=True , error_score='raise')

In [None]:
rf_srch.fit(x , y)

In [None]:
rf_srch.best_estimator_

In [None]:
rf_srch.best_params_

In [None]:
rf_srch.best_score_

In [84]:
with open("LogisticRegressionSave.pkl", "wb") as f:
    cloudpickle.dump(lr_srch.best_estimator_, f)

In [None]:
#joblib.dump(lr_srch.best_estimator_, 'LogisticRegression.h5')

In [88]:
#df.to_csv('Final_Data_Classification_Deplyment.csv')

In [13]:
%%writefile classification_app.py
import streamlit as st
import cloudpickle
import pandas as pd

# Load the model
@st.cache_resource  # Cache the model for better performance
def load_model():
    with open("LogisticRegressionSave.pkl", "rb") as f:
        model = cloudpickle.load(f)
    return model

model = load_model()

# Set up the app UI
st.title("Text Classification App")
st.write("This app uses a logistic regression model to predict text categories")

# Text input
user_input = st.text_area("Enter the text you want to classify:", "")

if st.button("Predict"):
    if user_input:
        # Create DataFrame (model expects series/df)
        input_df = pd.DataFrame({'text': [user_input]})
        
        # Get prediction
        prediction = model.predict(input_df)
        
        # Display result
        st.success(f"Prediction: {prediction[0]}")
        
        # (Optional) Get class probabilities
        try:
            probabilities = model.predict_proba(input_df)
            st.subheader("Class Probabilities:")
            for i, prob in enumerate(probabilities[0]):
                st.write(f"Class {i}: {prob:.2%}")
        except:
            pass
    else:
        st.warning("Please enter some text to classify")

# Add sidebar with info
st.sidebar.header("About")
st.sidebar.info(
    """
    This application uses a machine learning pipeline that includes:
    - Text cleaning
    - TF-IDF vectorization (5000 features)
    - Feature selection (Decision Tree)
    - Logistic Regression classifier
    """
)

Overwriting classification_app.py


In [15]:
! streamlit run classification_app.py

^C
