<a href="https://colab.research.google.com/github/egenc/DataScience_tasks/blob/main/Task_customData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

PATH = "/content/drive/MyDrive/dataset/costPre.csv"

df = pd.read_csv(PATH)
df.head()

Unnamed: 0,Action,22 Jan,23 Jan
0,I am buying less food and essentials,21,31
1,"I am using less water, energy or fuel",22,45
2,I am buying cheaper products,36,49
3,I am shopping around more or switching providers,22,27
4,I am spending less on non-essentials,45,53


In [2]:
df.columns = ["Action", "22Jan", "23Jan"]
df.head()

Unnamed: 0,Action,22Jan,23Jan
0,I am buying less food and essentials,21,31
1,"I am using less water, energy or fuel",22,45
2,I am buying cheaper products,36,49
3,I am shopping around more or switching providers,22,27
4,I am spending less on non-essentials,45,53


Dropping duplicate to reduce noise

In [3]:
print(len(df))
df = df.drop_duplicates(subset="Action",
                     keep=False)
len(df)

25


25

Checking Empty Strings - These might create noises since they are None and may lead to performance decrease



In [4]:
def check_empty_strings(s) -> bool:
  return len(s.replace(" ", "")) > 0

df["string_content"] = df.apply(lambda row: check_empty_strings(row["Action"]), axis=1)
df[df.string_content == False]

Unnamed: 0,Action,22Jan,23Jan,string_content


In [5]:
df = df.drop(df[df.string_content == False].index)

df.head(-10)

Unnamed: 0,Action,22Jan,23Jan,string_content
0,I am buying less food and essentials,21,31,True
1,"I am using less water, energy or fuel",22,45,True
2,I am buying cheaper products,36,49,True
3,I am shopping around more or switching providers,22,27,True
4,I am spending less on non-essentials,45,53,True
5,I am using free transport (walking or cycling),26,21,True
6,I am doing free activities,19,13,True
7,"I am going without essentials (food, electrici...",6,9,True
8,I am stopping or delaying spend on non-essentials,32,24,True
9,I am using my savings,20,20,True


Let's get the longest string

In [6]:
int(df["Action"].str.len().max())

114

Removing "I am " for clarity

In [7]:
df["Action"] = df["Action"].str.replace("I am ", "")
df.head()

Unnamed: 0,Action,22Jan,23Jan,string_content
0,buying less food and essentials,21,31,True
1,"using less water, energy or fuel",22,45,True
2,buying cheaper products,36,49,True
3,shopping around more or switching providers,22,27,True
4,spending less on non-essentials,45,53,True


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 24
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Action          25 non-null     object
 1   22Jan           25 non-null     object
 2   23Jan           25 non-null     int64 
 3   string_content  25 non-null     bool  
dtypes: bool(1), int64(1), object(2)
memory usage: 1.3+ KB


In [9]:
df.drop("string_content", axis=1, inplace=True)
df.head()

Unnamed: 0,Action,22Jan,23Jan
0,buying less food and essentials,21,31
1,"using less water, energy or fuel",22,45
2,buying cheaper products,36,49
3,shopping around more or switching providers,22,27
4,spending less on non-essentials,45,53


In [10]:
# Replacing unknown string values (?, -) with 0 values
import numpy as np

for col in df.columns:
    df[col].replace({'?':0},inplace=True)
    df[col].replace({'-':0},inplace=True)

In [14]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [24]:
df['Action'] = df['Action'].str.replace('[{}]'.format(string.punctuation), '')

  df['Action'] = df['Action'].str.replace('[{}]'.format(string.punctuation), '')


In [25]:
df.head(-5)

Unnamed: 0,Action,22Jan,23Jan
0,buying less food and essentials,21,31
1,using less water energy or fuel,22,45
2,buying cheaper products,36,49
3,shopping around more or switching providers,22,27
4,spending less on nonessentials,45,53
5,using free transport walking or cycling,26,21
6,doing free activities,19,13
7,going without essentials food electricity or g...,6,9
8,stopping or delaying spend on nonessentials,32,24
9,using my savings,20,20


There is no null values.

Let's predict values for 22Jan first.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["Action"])
y = df["22Jan"]

In [27]:
from sklearn import linear_model

clf = linear_model.BayesianRidge()

clf.fit(X.toarray(), y)

BayesianRidge()

In [28]:
def preprocess_input(List) -> list:

  List = [s.lower() for s in List]
  List = [s.translate(str.maketrans('', '', string.punctuation)) for s in List]
  for s in List:
    if len(s) > 100: # Already checked length above
      raise Exception("One or more strings are longer than 100 chars")
  return List

def predictor(classifier, name_l) -> list:
  name_l = preprocess_input(name_l)

  tmp = vectorizer.transform(name_l)
  preds = classifier.predict(tmp)
  
  return preds

In [29]:
predictor(clf, ["buying less food, less drinks"])

array([34.97362181])

As it is seen, input data has the affect of **34.97362181%**.

Let's train another model for **Jan23**

In [30]:
y_23 = df["23Jan"]

clf_23 = linear_model.BayesianRidge()

clf_23.fit(X.toarray(), y_23)

predictor(clf_23, ["buying less food, less drinks"])

array([51.6133503])

As it is seen, it works well from the first glimpse. Because Training data also have similar results.

In [31]:
text = "buying less food, descreasing market costs"
print("22Jan:", predictor(clf, [text]))
print("23Jan:", predictor(clf_23, [text]))

22Jan: [25.77541641]
23Jan: [36.19332146]


Again, we have similar results.

<< Please change the text above and try out your own input. >>