In [2]:
import pandas as pd

To access Stackoverflow data we should aggregate the locations to the NUTS3 region in order to preserve data privacy of Stackoverflow users. 

**What is NUTS regions?**

The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing up the economic territory of the EU and the UK for the purpose of:

The collection, development and harmonisation of European regional statistics

Socio-economic analyses of the regions
1. NUTS 1: major socio-economic regions
2. NUTS 2: basic regions for the application of regional policies
3. NUTS 3: small regions for specific diagnoses 

In [3]:
# Replace 'your_file.dta' with the path to your Stata file
locations = 'Downloads/deeslocations--morecountries.csv'

# Read the Statcsva file into a pandas DataFrame
location_df = pd.read_csv(locations)

#Filter out the country of interest - for the beginning we will start with mapping Poland poster locations to the NUTS3 regions
location_pl = location_df[location_df["Country"] == "PL"]
location_pl.reset_index(drop=True, inplace=True)
location_pl

Unnamed: 0,Town,Region,Country
0,Żychlin,Łódź Voivodeship,PL
1,Łódź,Łódź Voivodeship,PL
2,Łanięta,Łódź Voivodeship,PL
3,Łęczyca,Łódź Voivodeship,PL
4,Adamow,Łódź Voivodeship,PL
...,...,...,...
3248,Wolczkowo,West Pomerania,PL
3249,Wolin,West Pomerania,PL
3250,Zieleniewo,West Pomerania,PL
3251,Zielin,West Pomerania,PL


In [6]:
location_pl[location_pl["Town"] == "Adamow"] 

Unnamed: 0,Town,Region,Country
4,Adamow,Łódź Voivodeship,PL
1726,Adamow,Mazovia,PL
2493,Adamow,Silesia,PL


The same town name appears for more than a one region, hence we would need to map towns based not only on the name, but also the Region to be able to correctly assign NUTS3.

In [137]:
#Import dataset with nuts region mapping for Poland
mapping = 'Downloads/NUTS mapping.xlsx'

# Read the Statcsva file into a pandas DataFrame
mapping_df = pd.read_excel(mapping)
mapping_df

Unnamed: 0,NUTS1,NUTS1 CODE,NUTS2,NUTS2 CODE,NUTS3,NUTS3 CODE,County
0,MAKROREGION PolNOCNY,PL6,KUJAWSKO-POMERANIA,PL61,WlOClAWSKI,PL619,aleksandrowski
1,MAKROREGION WSCHODNI,PL8,PODLASIE,PL84,SUWALSKI,PL843,augustowski
2,MAKROREGION PolNOCNY,PL6,WARMIA-MASURIA,PL62,OLSZTYnSKI,PL622,bartoszycki
3,MAKROREGION POlUDNIOWY,PL2,SILESIA,PL22,SOSNOWIECKI,PL22B,będziński
4,MAKROREGION CENTRALNY,PL7,ŁÓDŹ VOIVODESHIP,PL71,PIOTRKOWSKI,PL713,bełchatowski
...,...,...,...,...,...,...,...
368,MAKROREGION POlUDNIOWY,PL2,SILESIA,PL22,RYBNICKI,PL227,Żory
369,MAKROREGION WOJEWoDZTWO MAZOWIECKIE\t,PL9,MAZOVIA,PL92,CIECHANOWSKI,PL922,żuromiński
370,MAKROREGION WOJEWoDZTWO MAZOWIECKIE\t,PL9,MAZOVIA,PL92,RADOMSKI,PL921,zwoleński
371,MAKROREGION WOJEWoDZTWO MAZOWIECKIE\t,PL9,MAZOVIA,PL92,ZYRARDOWSKI,PL926,żyrardowski


In [138]:
mapping_df = mapping_df.rename({"Region": "County"})

The table is containing the data about the NUTS region mapping for Poland. 

NUTS1 / 2 / 3 correspond to the name of the NUTS and NUTS1/2/3 CODE to the code. 
Region is the smaller administration segment **within** NUTS3. Each of the "County" is attached to the specific NUTS3. 

From the location_pl dataframe we can access information only about the city and the major Region. We need to mapped "County" from table mapping_df to the Town + Region of the location_pl dataframe. 
To do that we should access the list of all the town in Poland and the smaller "County" where it is located. 

In [139]:
# Let's drop unnecessary columns
columns_to_drop = ['NUTS1', 'NUTS2', 'NUTS3']
mapping_df = mapping_df.drop(columns=columns_to_drop)
mapping_df

Unnamed: 0,NUTS1 CODE,NUTS2 CODE,NUTS3 CODE,County
0,PL6,PL61,PL619,aleksandrowski
1,PL8,PL84,PL843,augustowski
2,PL6,PL62,PL622,bartoszycki
3,PL2,PL22,PL22B,będziński
4,PL7,PL71,PL713,bełchatowski
...,...,...,...,...
368,PL2,PL22,PL227,Żory
369,PL9,PL92,PL922,żuromiński
370,PL9,PL92,PL921,zwoleński
371,PL9,PL92,PL926,żyrardowski


We were able to access the database with all towns and villages in Poland with associated: municipalities, districts & regions. 
This dataset will help us map: 

"Town" & "Region" from location_pl dataframe with associated "County". 
By having this in the next step we will be able to map "County" with NUTS3 code accordingly.  

In [19]:
#Let's import the dataset
poland_location = 'Downloads/Location_list_Poland.xlsx'

# Read the Statcsva file into a pandas DataFrame
poland_df = pd.read_excel(poland_location)

unique_regions = poland_df['County'].unique()
df = pd.DataFrame(unique_regions)
df.to_excel('Downloads/counties.xlsx')

Unnamed: 0,Town,Municipaty,County,Region
0,Abisynia,Kcynia,nakielski,kujawsko-pomorskie
1,Abisynia,Drzycim,świecki,kujawsko-pomorskie
2,Abisynia,Leśna Podlaska,bialski,lubelskie
3,Abisynia,Hrubieszów,hrubieszowski,lubelskie
4,Abisynia,Karsin,kościerski,pomorskie
...,...,...,...,...
102870,Żyznów,Strzyżów,strzyżowski,podkarpackie
102871,Żyznów,Klimontów,sandomierski,świętokrzyskie
102872,Żyznówka,Trzciana,bocheński,małopolskie
102873,Żyznówka,Rabka-Zdrój,nowotarski,małopolskie


Values for the column "Region" are in polish, while the values in the location_df table are in english. 
We should map those names accordingly, from polish to english. 

In [21]:
value_mapping = {
    'małopolskie': 'Lesser Poland',
    'mazowieckie': 'Mazovia',
    'łódzkie': 'Łódź Voivodeship',
    'lubelskie': 'Lublin',
    'wielkopolskie': 'Greater Poland',
    'podkarpackie': 'Subcarpathian',
    'Świętokrzyskie': 'Świętokrzyskie',
    'Kujawsko-Pomorskie': 'Kujawsko-Pomorskie',
    'podlaskie': 'Podlasie',
    'śląskie': 'Silesia',
    'warmińsko-mazurskie': 'Warmia-Masuria',
    'pomorskie': 'Pomerania',
    'zachodniopomorskie': 'West Pomerania',
    'dolnośląskie': 'Lower Silesia',
    'opolskie': 'Opole Voivodeship',
    'lubuskie': 'Lubusz'
}

# Replace the values in the "Wojewodztwo" column using the mapping
poland_df['Region'] = poland_df['Region'].replace(value_mapping)
poland_df

Unnamed: 0,Town,Municipaty,County,Region
0,Abisynia,Kcynia,nakielski,Kujawsko-Pomorskie
1,Abisynia,Drzycim,świecki,Kujawsko-Pomorskie
2,Abisynia,Leśna Podlaska,bialski,Lublin
3,Abisynia,Hrubieszów,hrubieszowski,Lublin
4,Abisynia,Karsin,kościerski,Pomerania
...,...,...,...,...
102870,Żyznów,Strzyżów,strzyżowski,Subcarpathian
102871,Żyznów,Klimontów,sandomierski,Świętokrzyskie
102872,Żyznówka,Trzciana,bocheński,Lesser Poland
102873,Żyznówka,Rabka-Zdrój,nowotarski,Lesser Poland


Now, let's merge the tables based on the Town and Region in order to find out in which county given location from "location_pl" is located. 

In [108]:
merge_df = location_pl.merge(poland_df[["County","Town","Region"]], on=['Town', 'Region'], how='left')
merge_df

Unnamed: 0,Town,Region,Country,County
0,Żychlin,Łódź Voivodeship,PL,kutnowski
1,Żychlin,Łódź Voivodeship,PL,piotrkowski
2,Łódź,Łódź Voivodeship,PL,Łódź
3,Łanięta,Łódź Voivodeship,PL,kutnowski
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski
...,...,...,...,...
4119,Zieleniewo,West Pomerania,PL,koszaliński
4120,Zieleniewo,West Pomerania,PL,stargardzki
4121,Zielin,West Pomerania,PL,gryficki
4122,Zielin,West Pomerania,PL,gryfiński


In [25]:
merge_df['County'].isna().sum()

1432

In [112]:
merge_df = merge_df.dropna()
merge_df

Unnamed: 0,Town,Region,Country,County
0,Żychlin,Łódź Voivodeship,PL,kutnowski
1,Żychlin,Łódź Voivodeship,PL,piotrkowski
2,Łódź,Łódź Voivodeship,PL,Łódź
3,Łanięta,Łódź Voivodeship,PL,kutnowski
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski
...,...,...,...,...
4118,Zieleniewo,West Pomerania,PL,kołobrzeski
4119,Zieleniewo,West Pomerania,PL,koszaliński
4120,Zieleniewo,West Pomerania,PL,stargardzki
4121,Zielin,West Pomerania,PL,gryficki


We have still 1432 Towns that weren't able to be mapped. Let's dig in, why we didn't managed to do that. 

In [27]:
nan_df = merge_df[merge_df.isna().any(axis=1)]
nan_df

Unnamed: 0,Town,Region,Country,County
7,Adamow,Łódź Voivodeship,PL,
9,Aleksandrow,Łódź Voivodeship,PL,
11,Antoninow,Łódź Voivodeship,PL,
17,Baluty,Łódź Voivodeship,PL,
22,Bedzelin,Łódź Voivodeship,PL,
...,...,...,...,...
4108,Uniescie,West Pomerania,PL,
4110,Walcz,West Pomerania,PL,
4112,Wegorzyno,West Pomerania,PL,
4115,Wolczkowo,West Pomerania,PL,


In [28]:
poland_df[poland_df["Town"] == "Adamow"]

Unnamed: 0,Town,Municipaty,County,Region


In [29]:
poland_df[poland_df["Town"] == "Adamów"]

Unnamed: 0,Town,Municipaty,County,Region
88,Adamów,Rejowiec,chełmski,Lublin
89,Adamów,Cyców,łęczyński,Lublin
90,Adamów,Adamów,łukowski,Lublin
91,Adamów,Adamów,zamojski,Lublin
92,Adamów,Bełchatów,bełchatowski,Łódź Voivodeship
93,Adamów,Kleszczów,bełchatowski,Łódź Voivodeship
94,Adamów,Brzeziny,brzeziński,Łódź Voivodeship
95,Adamów,Bedlno,kutnowski,Łódź Voivodeship
96,Adamów,Kutno,kutnowski,Łódź Voivodeship
97,Adamów,Oporów,kutnowski,Łódź Voivodeship


Apparently in the original,source dataframe some of the poster location has names that are written with polish signs, but some of them do not have those. Therefore, the matching wasn't possible. 

We should try again by providing the dataframe that does not contain polish signs and check out the matching rate. 

In [31]:
#Let's import the dataset
poland_location_2 = 'Downloads/Location_list_without_signs.xlsx'

# Read the Statcsva file into a pandas DataFrame
poland_signs = pd.read_excel(poland_location_2)
poland_signs

Unnamed: 0,Town,Municipaty,County,Region
0,Chaciaki,Zywiec,żywiecki,śląskie
1,Do Bialkow,Zywiec,żywiecki,śląskie
2,Do Blachuciakow,Zywiec,żywiecki,śląskie
3,Do Cygoniow,Zywiec,żywiecki,śląskie
4,Do Cyrnali,Zywiec,żywiecki,śląskie
...,...,...,...,...
102870,Stara Wies,Abramow,lubartowski,lubelskie
102871,Wielkie,Abramow,lubartowski,lubelskie
102872,Wielkolas,Abramow,lubartowski,lubelskie
102873,Wolica,Abramow,lubartowski,lubelskie


In [35]:
value_mapping = {
    'małopolskie': 'Lesser Poland',
    'mazowieckie': 'Mazovia',
    'łódzkie': 'Łódź Voivodeship',
    'lubelskie': 'Lublin',
    'wielkopolskie': 'Greater Poland',
    'podkarpackie': 'Subcarpathian',
    'Świętokrzyskie': 'Świętokrzyskie',
    'Kujawsko-Pomorskie': 'Kujawsko-Pomorskie',
    'podlaskie': 'Podlasie',
    'śląskie': 'Silesia',
    'warmińsko-mazurskie': 'Warmia-Masuria',
    'pomorskie': 'Pomerania',
    'zachodniopomorskie': 'West Pomerania',
    'dolnośląskie': 'Lower Silesia',
    'opolskie': 'Opole Voivodeship',
    'lubuskie': 'Lubusz'
}

# Replace the values in the "Wojewodztwo" column using the mapping
poland_signs['Region'] = poland_signs['Region'].replace(value_mapping)
poland_signs

Unnamed: 0,Town,Municipaty,County,Region
0,Chaciaki,Zywiec,żywiecki,Silesia
1,Do Bialkow,Zywiec,żywiecki,Silesia
2,Do Blachuciakow,Zywiec,żywiecki,Silesia
3,Do Cygoniow,Zywiec,żywiecki,Silesia
4,Do Cyrnali,Zywiec,żywiecki,Silesia
...,...,...,...,...
102870,Stara Wies,Abramow,lubartowski,Lublin
102871,Wielkie,Abramow,lubartowski,Lublin
102872,Wielkolas,Abramow,lubartowski,Lublin
102873,Wolica,Abramow,lubartowski,Lublin


We will repeat the excercise with the remaining set of the locations.

In [32]:
nan_df = nan_df.drop(columns = "County")
nan_df

Unnamed: 0,Town,Region,Country
7,Adamow,Łódź Voivodeship,PL
9,Aleksandrow,Łódź Voivodeship,PL
11,Antoninow,Łódź Voivodeship,PL
17,Baluty,Łódź Voivodeship,PL
22,Bedzelin,Łódź Voivodeship,PL
...,...,...,...
4108,Uniescie,West Pomerania,PL
4110,Walcz,West Pomerania,PL
4112,Wegorzyno,West Pomerania,PL
4115,Wolczkowo,West Pomerania,PL


In [36]:
second_merge = nan_df.merge(poland_signs[["County","Town","Region"]], on=['Town', 'Region'], how='left')
second_merge

Unnamed: 0,Town,Region,Country,County
0,Adamow,Łódź Voivodeship,PL,opoczyński
1,Adamow,Łódź Voivodeship,PL,piotrkowski
2,Adamow,Łódź Voivodeship,PL,pajęczański
3,Adamow,Łódź Voivodeship,PL,poddębicki
4,Adamow,Łódź Voivodeship,PL,opoczyński
...,...,...,...,...
1968,Uniescie,West Pomerania,PL,koszaliński
1969,Walcz,West Pomerania,PL,wałecki
1970,Wegorzyno,West Pomerania,PL,łobeski
1971,Wolczkowo,West Pomerania,PL,policki


In [37]:
second_merge['County'].isna().sum()

292

In [140]:
second_merge = second_merge.dropna()
second_merge

Unnamed: 0,Town,Region,Country,County
0,Adamow,Łódź Voivodeship,PL,opoczyński
1,Adamow,Łódź Voivodeship,PL,piotrkowski
2,Adamow,Łódź Voivodeship,PL,pajęczański
3,Adamow,Łódź Voivodeship,PL,poddębicki
4,Adamow,Łódź Voivodeship,PL,opoczyński
...,...,...,...,...
1968,Uniescie,West Pomerania,PL,koszaliński
1969,Walcz,West Pomerania,PL,wałecki
1970,Wegorzyno,West Pomerania,PL,łobeski
1971,Wolczkowo,West Pomerania,PL,policki


In [56]:
nan_2nd = second_merge[second_merge.isna().any(axis=1)]
nan_2nd

Unnamed: 0,Town,Region,Country,County
64,Gadka Stara,Łódź Voivodeship,PL,
68,Gmina Błaszki,Łódź Voivodeship,PL,
69,Gmina Lutomiersk,Łódź Voivodeship,PL,
70,Gmina Lututów,Łódź Voivodeship,PL,
71,Gmina Moszczenica,Łódź Voivodeship,PL,
...,...,...,...,...
1935,Gmina Kalisz Pomorski,West Pomerania,PL,
1942,Kliniska,West Pomerania,PL,
1947,Mierzyn k. Szczecina,West Pomerania,PL,
1961,Stargard,West Pomerania,PL,


After the second merge we were able to map most of the locations. 
However this type most of the left locations are misspelled: 

- they are lacking part of the name ( we should see "Kliniska Wielkie" instead of "Klinska" 
- they have too much words (we should see "Mierzyn" instead of "Mierzyn k. Szczecina" 
- lots of them has a prefix "Gmina" before the name of location. "Gmina" is a equivalent of "Municipaty", but it should be put as a name (we should see "Lutomiersk" instead of "Gmina Lutomiersk" 



In [163]:
nan_2nd['Town'] = nan_2nd['Town'].str.replace('Gmina ', '', regex=False)
nan_2nd = nan_2nd.drop(columns = "County")

In [164]:
nan_2nd

Unnamed: 0,Town,Region,Country
64,Gadka Stara,Łódź Voivodeship,PL
68,Błaszki,Łódź Voivodeship,PL
69,Lutomiersk,Łódź Voivodeship,PL
70,Lututów,Łódź Voivodeship,PL
71,Moszczenica,Łódź Voivodeship,PL
...,...,...,...
1935,Kalisz Pomorski,West Pomerania,PL
1942,Kliniska,West Pomerania,PL
1947,Mierzyn k. Szczecina,West Pomerania,PL
1961,Stargard,West Pomerania,PL


In [166]:
with_pl = nan_2nd.merge(poland_df[["County","Town","Region"]], on=['Town', 'Region'], how='left')
with_pl

Unnamed: 0,Town,Region,Country,County
0,Gadka Stara,Łódź Voivodeship,PL,
1,Błaszki,Łódź Voivodeship,PL,sieradzki
2,Lutomiersk,Łódź Voivodeship,PL,pabianicki
3,Lututów,Łódź Voivodeship,PL,wieruszowski
4,Moszczenica,Łódź Voivodeship,PL,piotrkowski
...,...,...,...,...
335,Kalisz Pomorski,West Pomerania,PL,drawski
336,Kliniska,West Pomerania,PL,
337,Mierzyn k. Szczecina,West Pomerania,PL,
338,Stargard,West Pomerania,PL,


In [167]:
without_pl = nan_2nd.merge(poland_signs[["County","Town","Region"]], on=['Town', 'Region'], how='left')
without_pl 

Unnamed: 0,Town,Region,Country,County
0,Gadka Stara,Łódź Voivodeship,PL,
1,Błaszki,Łódź Voivodeship,PL,
2,Lutomiersk,Łódź Voivodeship,PL,pabianicki
3,Lututów,Łódź Voivodeship,PL,
4,Moszczenica,Łódź Voivodeship,PL,zgierski
...,...,...,...,...
322,Kalisz Pomorski,West Pomerania,PL,drawski
323,Kliniska,West Pomerania,PL,
324,Mierzyn k. Szczecina,West Pomerania,PL,
325,Stargard,West Pomerania,PL,


Let's now concat all the dataframes that we have managed to mapped with different methods. 
Then we will map those with the NUTS mapping dataframe and delete duplicates. 

In [176]:
combined_df = merge_df.append([second_merge, with_pl, without_pl], ignore_index=True) 

  combined_df = merge_df.append([second_merge, with_pl, without_pl], ignore_index=True)


In [177]:
combined_df

Unnamed: 0,Town,Region,Country,County
0,Żychlin,Łódź Voivodeship,PL,kutnowski
1,Żychlin,Łódź Voivodeship,PL,piotrkowski
2,Łódź,Łódź Voivodeship,PL,Łódź
3,Łanięta,Łódź Voivodeship,PL,kutnowski
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski
...,...,...,...,...
5035,Kalisz Pomorski,West Pomerania,PL,drawski
5036,Kliniska,West Pomerania,PL,
5037,Mierzyn k. Szczecina,West Pomerania,PL,
5038,Stargard,West Pomerania,PL,


In [178]:
combined_merged = combined_df.merge(mapping_df[["County","NUTS3 CODE"]], on=['County'], how='left') 

In [179]:
mapping_df[mapping_df["County"] == "Kraków"]

Unnamed: 0,NUTS1 CODE,NUTS2 CODE,NUTS3 CODE,County
120,PL2,PL21,PL213,Kraków


In [180]:
combined_merged

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
0,Żychlin,Łódź Voivodeship,PL,kutnowski,PL715
1,Żychlin,Łódź Voivodeship,PL,piotrkowski,PL713
2,Łódź,Łódź Voivodeship,PL,Łódź,PL711
3,Łanięta,Łódź Voivodeship,PL,kutnowski,PL715
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski,PL713
...,...,...,...,...,...
5060,Kalisz Pomorski,West Pomerania,PL,drawski,PL427
5061,Kliniska,West Pomerania,PL,,
5062,Mierzyn k. Szczecina,West Pomerania,PL,,
5063,Stargard,West Pomerania,PL,,


In [181]:
df_no_duplicates = combined_merged.drop_duplicates(subset=['Town', 'NUTS3 CODE'])
df_no_duplicates

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
0,Żychlin,Łódź Voivodeship,PL,kutnowski,PL715
1,Żychlin,Łódź Voivodeship,PL,piotrkowski,PL713
2,Łódź,Łódź Voivodeship,PL,Łódź,PL711
3,Łanięta,Łódź Voivodeship,PL,kutnowski,PL715
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski,PL713
...,...,...,...,...,...
4995,Puszcza Mariańska,Mazovia,PL,,
5012,Pawłowiczki,Opole Voivodeship,PL,,
5028,Główczyce,Pomerania,PL,,
5029,Potęgowo,Pomerania,PL,,


In [184]:
final_merge = df_no_duplicates['NUTS3 CODE'].isna().sum()
final_merge

261

In [205]:
nan_3rd = df_no_duplicates[df_no_duplicates.isna().any(axis=1)]
nan_3rd

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
4398,Gadka Stara,Łódź Voivodeship,PL,,
4413,Kolonia Gorka Klonowska,Łódź Voivodeship,PL,,
4414,Konstantynow Lodzki,Łódź Voivodeship,PL,,
4415,Oddzial,Łódź Voivodeship,PL,,
4416,Sucha Stara,Łódź Voivodeship,PL,,
...,...,...,...,...,...
4995,Puszcza Mariańska,Mazovia,PL,,
5012,Pawłowiczki,Opole Voivodeship,PL,,
5028,Główczyce,Pomerania,PL,,
5029,Potęgowo,Pomerania,PL,,


In [206]:
nan_3rd = nan_3rd.drop(columns = ["County", "NUTS3 CODE"])

In [207]:
nan_3rd

Unnamed: 0,Town,Region,Country
4398,Gadka Stara,Łódź Voivodeship,PL
4413,Kolonia Gorka Klonowska,Łódź Voivodeship,PL
4414,Konstantynow Lodzki,Łódź Voivodeship,PL
4415,Oddzial,Łódź Voivodeship,PL
4416,Sucha Stara,Łódź Voivodeship,PL
...,...,...,...
4995,Puszcza Mariańska,Mazovia,PL
5012,Pawłowiczki,Opole Voivodeship,PL
5028,Główczyce,Pomerania,PL
5029,Potęgowo,Pomerania,PL


In [208]:
nan_3rd = nan_3rd.merge(poland_df[["County", "Town", "Region"]], on = (["Town", "Region"]), how = "left")

In [209]:
nan_3rd

Unnamed: 0,Town,Region,Country,County
0,Gadka Stara,Łódź Voivodeship,PL,
1,Kolonia Gorka Klonowska,Łódź Voivodeship,PL,
2,Konstantynow Lodzki,Łódź Voivodeship,PL,
3,Oddzial,Łódź Voivodeship,PL,
4,Sucha Stara,Łódź Voivodeship,PL,
...,...,...,...,...
269,Pawłowiczki,Opole Voivodeship,PL,kędzierzyńsko-kozielski
270,Główczyce,Pomerania,PL,słupski
271,Potęgowo,Pomerania,PL,słupski
272,Potęgowo,Pomerania,PL,wejherowski


In [210]:
nan_3rd = nan_3rd.merge(mapping_df[["County", "NUTS3 CODE"]], on = (["County"]), how = "left")
filtered_left = nan_3rd[nan_3rd.isna().any(axis=1)]
new = nan_3rd.dropna()

new

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
209,Błaszki,Łódź Voivodeship,PL,sieradzki,PL714
210,Lututów,Łódź Voivodeship,PL,wieruszowski,PL714
211,Osjaków,Łódź Voivodeship,PL,wieluński,PL714
212,Parzęczew,Łódź Voivodeship,PL,zgierski,PL712
213,Piątek,Łódź Voivodeship,PL,łęczycki,PL715
...,...,...,...,...,...
269,Pawłowiczki,Opole Voivodeship,PL,kędzierzyńsko-kozielski,PL524
270,Główczyce,Pomerania,PL,słupski,PL636
271,Potęgowo,Pomerania,PL,słupski,PL636
272,Potęgowo,Pomerania,PL,wejherowski,PL634


In [211]:
final = df_no_duplicates.append([df_no_duplicates,new], ignore_index=True) 

  final = df_no_duplicates.append([df_no_duplicates,new], ignore_index=True)


In [212]:
final

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
0,Żychlin,Łódź Voivodeship,PL,kutnowski,PL715
1,Żychlin,Łódź Voivodeship,PL,piotrkowski,PL713
2,Łódź,Łódź Voivodeship,PL,Łódź,PL711
3,Łanięta,Łódź Voivodeship,PL,kutnowski,PL715
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski,PL713
...,...,...,...,...,...
7910,Pawłowiczki,Opole Voivodeship,PL,kędzierzyńsko-kozielski,PL524
7911,Główczyce,Pomerania,PL,słupski,PL636
7912,Potęgowo,Pomerania,PL,słupski,PL636
7913,Potęgowo,Pomerania,PL,wejherowski,PL634


In [215]:
final_no_duplicates = final.drop_duplicates(subset=['Town', 'NUTS3 CODE'])
final_no_duplicates = final_no_duplicates.dropna()
final_no_duplicates

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
0,Żychlin,Łódź Voivodeship,PL,kutnowski,PL715
1,Żychlin,Łódź Voivodeship,PL,piotrkowski,PL713
2,Łódź,Łódź Voivodeship,PL,Łódź,PL711
3,Łanięta,Łódź Voivodeship,PL,kutnowski,PL715
4,Łęczyca,Łódź Voivodeship,PL,bełchatowski,PL713
...,...,...,...,...,...
3863,Purda,Warmia-Masuria,PL,olsztyński,PL622
3865,Banie,West Pomerania,PL,gryfiński,PL428
3866,Bielice,West Pomerania,PL,drawski,PL427
3867,Bielice,West Pomerania,PL,goleniowski,PL428


In [216]:
final_no_duplicates.to_excel("Downloads/mapping_final.xlsx")

In [218]:
filtered_left

Unnamed: 0,Town,Region,Country,County,NUTS3 CODE
0,Gadka Stara,Łódź Voivodeship,PL,,
1,Kolonia Gorka Klonowska,Łódź Voivodeship,PL,,
2,Konstantynow Lodzki,Łódź Voivodeship,PL,,
3,Oddzial,Łódź Voivodeship,PL,,
4,Sucha Stara,Łódź Voivodeship,PL,,
...,...,...,...,...,...
204,Ilowo,Warmia-Masuria,PL,,
205,Kliniska,West Pomerania,PL,,
206,Mierzyn k. Szczecina,West Pomerania,PL,,
207,Stargard,West Pomerania,PL,,


In [219]:
filtered_left.to_excel("Downloads/remained.xlsx")