# Fuzzy Matching

Last update 29.08.2023 Anna K

In [1]:
import pandas as pd

## Uploading the cleaned dataset

### Try 1

In [2]:
# Load your existing database into a DataFrame
data = pd.read_csv('from_yannik/data_clean_with_stopwords.csv') # insert path
# Notice the .copy() to copy the values 
data = data.copy()

In [3]:
data.reset_index()
data = data[["sender", "text", "date"]]

### Try 2

In [4]:
# Load your existing database into a DataFrame
data2 = pd.read_csv('from_yannik/data_handover_for_anna.csv') # insert path
# Notice the .copy() to copy the values 
data2 = data2.copy()

### Try 3

In [5]:
# Load your existing database into a DataFrame
data3 = pd.read_csv('from_yannik/data_handover_for_team.csv') # insert path
# Notice the .copy() to copy the values 
data3 = data3.copy()

## Preparation: Fuzzy Matching & Stations

In [6]:
!pip install thefuzz



In [7]:
from thefuzz import process
from thefuzz import fuzz

In [32]:
# Upload dataframe with station names
station_df = pd.read_csv('s_u_stations_fixed_with_keys.csv')
# Creating a list with station names
stations = list(station_df['keys'].values)

## Filters to identify station

In [9]:
def identify_station(some_string):
    res1 = None
    res2 = None
    if some_string[1][1] > 70:
        res1 = some_string[1][0]
        return some_string[0][0], some_string[1][0]
    elif some_string[0][1] > 70:
        return some_string[0][0]
    return None

In [10]:
def identify_station_precise(some_string):
    res1 = None
    res2 = None
    if some_string[1][1] > 90:
        res1 = some_string[1][0]
        return some_string[0][0], some_string[1][0]
    elif some_string[0][1] > 79: #try 79 or 89 and other, better less lines but better quality
        return some_string[0][0]
    return None

# Fuzz Try 1

In [11]:
station_entries = []
for entries in data["text"]:
    out = process.extract(entries, stations, limit=2)
    station_entries.append(out)

Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '?!']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '?']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: ',']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '.']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '?']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '!!!!!?']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '^^']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '??????????!!!!!!!!!!!!']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '/']
Applied processor reduces input query to empt

Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '?']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '.']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '/']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '???']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '?']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '/']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: ',']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '!!!!!!!!!!!!!']
Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '+ +']
Applied processor reduces input query to empty string, al

In [12]:
data['fuzz'] = station_entries

In [13]:
data['station'] = station_entries
data['station'] = data['station'].map(identify_station)

In [14]:
data.tail(20)

Unnamed: 0,sender,text,date,fuzz,station
147452,999215014.0,s41 tempelhof und s41 ostkreuz \nfett kontros ...,2020-11-07 14:56:05,"[(gesundbrunnen, 60), (ostkreuz, 60)]",
147453,999365610.0,immer noch…,2023-03-24 12:45:03,"[(buch, 60), (messe nord/icc, 50)]",
147454,999365610.0,"bvg fährt alles, nur s-bahn & co werden bestreikt",2023-03-27 09:27:54,"[(leinestr, 68), (gehrenseestr, 60)]",
147455,999365610.0,"sind drin, hinteres ende des zuges. grade mehr...",2023-04-11 12:18:39,"[(mehringdamm, 90), (eichborndamm, 65)]",mehringdamm
147456,999365610.0,"nicht wirklich frei fahren relevant, aber viel...",2023-04-13 14:53:35,"[(mendelssohn bartholdy park, 86), (neukölln, ...",mendelssohn bartholdy park
147457,999365610.0,"2x ordnungsamt, vorderer teil des zuges am bun...",2023-05-17 09:43:12,"[(bundesplatz, 90), (hansaplatz, 63)]",bundesplatz
147458,999416326.0,u6 platz der luftbrücke,2023-07-05 17:13:00,"[(platz der luftbrücke, 95), (nauener platz, 86)]","(platz der luftbrücke, nauener platz)"
147459,999425588.0,hallo hätte jemand ein berlin ticket u vom ges...,2020-08-17 08:03:02,"[(karl bonhoeffer nervenklinik, 86), (karl mar...","(karl bonhoeffer nervenklinik, karl marx str)"
147460,999425588.0,yes got one yesterday at 13:30 would be lovely...,2020-08-17 08:20:29,"[(neu westend, 65), (konstanzer str, 51)]",
147461,999425588.0,hallo hätte jemand ein berlin ticket u vom ges...,2020-08-17 10:48:37,"[(karl bonhoeffer nervenklinik, 86), (karl mar...",karl bonhoeffer nervenklinik


# Fuzz Try 2 (partial ratio, dataset not fully clean)

In [15]:
data2.reset_index()
data2 = data2[["sender", "text", "date"]]

In [16]:
data2.dropna(subset='text', inplace=True)

In [17]:
data2["text"] = data2["text"].str.strip()

In [18]:
data2["text"].duplicated().sum()

8358

In [19]:
data2 = data2.drop_duplicates(subset='text')

In [20]:
data2 = data2[data2["text"] != ""]

In [21]:
data2.dropna(subset='text', inplace=True)

In [22]:
station_entries2 = []
for entries in data2["text"]:
    out = process.extract(entries, stations, limit=2, scorer=fuzz.partial_ratio)
    station_entries2.append(out)

Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '̈']


In [23]:
data2['station'] = station_entries2
data2['station'] = data2['station'].map(identify_station)

In [24]:
data2['fuzz'] = station_entries2

In [26]:
data2.loc[147452, 'text']

's41 tempelhof s41 ostkreuz fett kontros bvg sicherheit eingestiegen  db sicherheit beides sicherheit kontros  deren umhängekarte gesehn rausholen gesundbrunnen 4 teams unterwegs team is storkower raus'

In [25]:
data2.tail(20)

Unnamed: 0,sender,text,date,station,fuzz
147449,999215014.0,hackeschee markt 3 menschen friedrichstraße s1...,2020-08-06 11:22:34,"(friedrichstr, hackescher markt)","[(friedrichstr, 100), (hackescher markt, 94)]"
147450,999215014.0,2 kontros u 6 kochstraße ausgestiegen 2 kontro...,2020-08-12 06:11:57,"(kochstr, tegel)","[(kochstr, 100), (tegel, 80)]"
147451,999215014.0,polizei rathaus steglitz 1 kontrolleur,2020-08-20 15:44:48,"(rathaus steglitz, tegel)","[(rathaus steglitz, 100), (tegel, 80)]"
147452,999215014.0,s41 tempelhof s41 ostkreuz fett kontros bvg si...,2020-11-07 14:56:05,"(gesundbrunnen, ostkreuz)","[(gesundbrunnen, 100), (ostkreuz, 100)]"
147453,999365610.0,immer noch…,2023-03-24 12:45:03,,"[(buch, 67), (kochstr, 60)]"
147454,999365610.0,bvg fährt sbahn co bestreikt,2023-03-27 09:27:54,,"[(oberspree, 67), (seestr, 67)]"
147455,999365610.0,drin hinteres ende zuges grade mehringdamm,2023-04-11 12:18:39,"(mehringdamm, eichborndamm)","[(mehringdamm, 100), (eichborndamm, 73)]"
147456,999365610.0,wirklich frei fahren relevant vielleicht trot...,2023-04-13 14:53:35,"(neukölln, tierpark)","[(neukölln, 100), (tierpark, 75)]"
147457,999365610.0,2x ordnungsamt vorderer teil zuges bundesplat...,2023-05-17 09:43:12,"(bundesplatz, bundestag)","[(bundesplatz, 100), (bundestag, 78)]"
147459,999425588.0,hallo hätte jemand berlin ticket u gestern 130...,2020-08-17 08:03:02,"(neukölln, karl marx str)","[(neukölln, 100), (karl marx str, 85)]"


### More precise mapping

In [27]:
data2['station'] = station_entries2
data2['station'] = data2['station'].map(identify_station_precise)

In [28]:
data2.tail(30)

Unnamed: 0,sender,text,date,station,fuzz
147439,999118258.0,u5 blauwesten hauptbanhof,2023-02-17 11:54:30,hauptbahnhof,"[(hauptbahnhof, 96), (westend, 86)]"
147440,999204625.0,s1 richtung wannsee 1 mann recht kräftig geb...,2020-07-15 16:42:15,"(potsdamer platz, wannsee)","[(potsdamer platz, 100), (wannsee, 100)]"
147441,999204625.0,s1 richtung oranienburg höhe friedrichstraße ...,2020-08-25 06:10:07,"(friedrichstr, oranienburg)","[(friedrichstr, 100), (oranienburg, 100)]"
147442,999204625.0,s1 nach wannsee höhe julius leber 2 frauen zug,2020-10-22 05:57:18,wannsee,"[(wannsee, 100), (julius leber brücke, 74)]"
147443,999204625.0,s1 friedrichstraße unteren bahnhof 3 männer ...,2020-12-22 10:21:47,friedrichstr,"[(friedrichstr, 100), (friedrichshagen, 80)]"
147444,999204625.0,2 ppl friedrichstraße 2nd one all black clothing,2021-02-25 09:24:20,friedrichstr,"[(friedrichstr, 100), (friedrichshagen, 80)]"
147445,999204625.0,2 männer frau kaiser wilhelm platz gerade beid...,2021-05-12 11:33:47,tegel,"[(tegel, 80), (tiergarten, 75)]"
147446,999204625.0,friedrichstraße direction west s5 direction ch...,2021-11-27 22:31:52,"(charlottenburg, friedrichstr)","[(charlottenburg, 100), (friedrichstr, 100)]"
147447,999204625.0,s1 at friedrichstraße direction north white bag,2021-12-08 07:40:29,friedrichstr,"[(friedrichstr, 100), (friedrichshagen, 80)]"
147448,999215014.0,s7 linie,2020-08-06 10:52:07,,"[(unter den linden, 67), (altglienicke, 62)]"


In [29]:
data2.dropna(subset="station", inplace = True)

In [30]:
data2

Unnamed: 0,sender,text,date,station,fuzz
2,-1.001571e+12,kontis mehringdamm weiß nich schon sorry gle...,2022-03-03 16:41:36,"(mehringdamm, spandau)","[(mehringdamm, 100), (spandau, 100)]"
3,-1.001571e+12,kontis moritzplatz,2022-03-10 14:38:26,moritzplatz,"[(moritzplatz, 100), (hansaplatz, 67)]"
5,-1.001571e+12,2 kontis u7 nach spandau bei konstanter str au...,2022-06-12 12:33:58,"(spandau, konstanzer str)","[(spandau, 100), (konstanzer str, 93)]"
6,-1.001571e+12,bitte teilen please spread alerta antifascist...,2022-07-14 12:36:03,kottbusser tor,"[(kottbusser tor, 100), (bundestag, 78)]"
7,-1.001615e+12,großkontrolle u8 schönleibstraße neongelben sc...,2021-12-10 22:49:44,schönleinstr,"[(schönleinstr, 92), (leinestr, 75)]"
...,...,...,...,...,...
147466,9.995103e+08,2 guys 2 girls u5 weberwiese,2020-06-26 08:22:52,weberwiese,"[(weberwiese, 100), (westend, 67)]"
147467,9.996845e+08,u2 bei rosalixenburger platz 3 manner,2021-03-08 13:48:04,rosa luxemburg platz,"[(rosa luxemburg platz, 80), (nauener platz, 77)]"
147468,9.996845e+08,u5 weberwiese 34 manner,2021-03-08 14:52:48,weberwiese,"[(weberwiese, 100), (erkner, 67)]"
147469,9.996845e+08,u2 rosa luxemburg platz richtung alex,2021-04-20 11:53:17,"(alex, rosa luxemburg platz)","[(alex, 100), (rosa luxemburg platz, 100)]"


In [31]:
data2.to_csv("preliminary_output_witt_fuzz.csv")

# Fuzz Try 3

In [33]:
stations = list(station_df['keys'].values)
station_entries3 = []
for entries in data3["text"]:
    out = process.extract(entries, stations, limit=2, scorer=fuzz.partial_ratio)
    station_entries3.append(out)

Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '̈']


In [47]:
df_chat = data3[["date"]]

In [48]:
df_chat["station_key"] = station_entries3
df_chat["text"] = data3["text"]
df_chat["station_key"] = df_chat["station_key"].map(identify_station_precise)
df_chat.dropna(subset="station_key", inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chat["station_key"] = station_entries3
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chat["text"] = data3["text"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_chat["station_key"] = df_chat["station_key"].map(identify_station_precise)
A value is trying to be set on a copy of a slice from

In [49]:
df_chat.reset_index()

Unnamed: 0,index,date,station_key,text
0,6,2018-02-15 12:03:54,wildau,https shopdigitalcouragedelichtbildausweismit...
1,9,2018-02-15 13:50:45,tegel,leute bitte hört nachrichten gruppe schreiben ...
2,18,2018-02-15 16:12:24,siemensdamm,siehe
3,24,2018-02-15 16:47:39,"(oranienburg, oranienburger str)",s25 oranienburgerstr richtung süden
4,31,2018-02-15 18:44:50,erkner,kritik durchaus berechtigt sehe aba tatsache g...
...,...,...,...,...
116636,138737,2023-08-18 11:48:39,eberswalder str,u2 eberswalder str bvg sicherheit polizei
116637,138738,2023-08-18 12:10:31,friedrichsfelde,u5 friedrichsfelde 2x blau
116638,138739,2023-08-18 12:36:47,mehringdamm,u6 mehringdamm station
116639,138741,2023-08-18 12:46:51,"(rathaus spandau, spandau)",3 boss s rathaus spandau


In [None]:
# Outline for Data analysis

# probability for monday
# probability for day of the week
# probability for time of the day
# probability for seasons
# most occuring stations > check with this for fuzzy match

# aggregate the time object > decompose and do aggregtions

