This notebook is a branch-off from CatalogueLife_Distribution.ipynb. It takes the distribution downloaded from the Catalogue of Life, "locationdescription.txt", and cleans it. Then it converts places into unique geopolitical regions. This cleaned dataset will be joined with other Catalogue of Life datasets to provide supplemental information to the downloaded GenBank records.

For documentation about what choices were made and why locations were cleaned, please refer to CatalogueLife_documentation.ipynb.

In [1]:
import re
import pandas as pd

In [2]:
distribution = "poly_locationdescription.txt"
outfile = "poly_distribution.csv"

In [3]:
# importing file
infile = open(distribution, "r")
location = infile.read()
infile.close()

In [4]:
# General cleaning using regex
regex = re.compile('\(.+?\)')
output = regex.sub('', location)

regex = re.compile('\[.+?\]')
output2 = regex.sub('', output)

regex = re.compile('\S+\-')
output3 = regex.sub('', output2)

regex = re.compile('\?')
output4 = regex.sub('', output3)

regex = re.compile('peninsular')
output5 = regex.sub('', output4)

regex = re.compile('\)')
output6 = regex.sub('', output5)

regex = re.compile('Isl.')
output7 = regex.sub('Island', output6)

regex = re.compile('\+')
output8 = regex.sub('', output7)

Next, extra cleaning procedures were implemented to help match geopolitical regions to countries or more granular places appearing in the Catalogue of Life authority for each species. See documentation for futher information.

In [5]:
#Extra cleaning to help match geopolitical regions to countries
#commas missing between countries in source files
output9 = output8.replace("Japan  India", "Japan, India").replace("Russian Far East  North Korea", "Russian Far East, North Korea").replace("Panama Cuba", "Panama, Cuba").replace("Colombia Bolivia", "Colombia, Bolivia").replace("Brazil  Brazil", "Brazil").replace("South Korea  Ryukyu Island", "South Korea, Ryukyu Island").replace("Rodrigues Island India", "Rodrigues Island, India").replace("Kermadec Island Chatham Island", "Kermadec Island, Chatham Island")

#special characters not removed
output10 = output9.replace("& Thailand", "Thailand").replace("& Sumatra", "Sumatra").replace("#India", "India").replace("& European Russia", "European Russia").replace("& Taiwan", "Taiwan")

#odd splitting
output11 = output10.replace("and  Thailand", "Thailand").replace("and Taiwan", "Taiwan").replace("China (Guangdong", "China").replace("India (Arunachal Pradesh", "India").replace("Comoros (Anjouan", "Comoros").replace("and Afghanistan", "Afghanistan").replace("China (Gansu", "China").replace("Canary Is.", "Canary Islands")

#short phrases
output12 = output11.replace("Sumatra to Sumatra", "Sumatra").replace("described from cultivated material in Holland", "Holland").replace("possibly from Ecuador", "Ecuador").replace("Vanuatu {this includes all Indian and Indochina populations of A. unilaterale fide Jenkins et al. . Whether the Australasian and Pacific records belong here", "Vanuatu").replace("to be expected in Bolivia", "Bolivia").replace("to be expected in the Guianas", "Guianas").replace("Norway to Hungary and Dalmatia", "Norway, Hungary, Dalmatia").replace("Malesia to New Guinea", "Malesia, New Guinea").replace("Nicaragua to Colombia", "Nicaragua, Colombia").replace("ldenburg River to Morobe District", "Morobe District")

#incorrect spelling
output13 = output12.replace("EI Salvador", "El Salvador").replace("Falkland Island", "Falkland Islands").replace("Zanmbia", "Zambia").replace("Channel Islandnds", "Channel Islands").replace("Faroer Island", "Faroe Islands").replace("Kuri Island", "Kuril Island").replace("Tawitawi", "Tawi-Tawi").replace("Sao Tom", "Sao Tome").replace("Azerbajian", "Azerbaijan").replace("Marshall Island", "Marshall Islands").replace("Sao Tomee", "Sao Tome").replace("Juan Fernndez Island", "Juan Fernandez Island").replace("Galega Island", "Agalega Island").replace("Mpulamanga", "Mpumalanga").replace("La Dsiderade", "La Desiderade").replace("Marie Galante", "Marie-galante")

#incorrect_id
# output13 = output13.replace("ï»¿45194626","45194626")

cleaned_data = output13
print(cleaned_data)

ï»¿45194626	New Guinea 
45194627	Ecuador
45194628	Bolivia , Ecuador, Peru
45194632	Mexico , Belize, Guatemala, Honduras, El Salvador, Nicaragua, Costa Rica, Panama, Colombia , Venezuela , Ecuador, Cuba, Jamaica, Hispaniola, Puerto Rico, St. Eustatius, St. Kitts, Montserrat, Guadeloupe, Dominica, Martinique, St. Lucia, St. Vincent, Trinidad, Tobago
45194645	India 
45194655	Madagascar
45194658	Australia , Solomon Island , Bonin Island , Volcano Island , New Guinea, Bismarck Arch. , Moluccas , Philippines, Sulawesi, Java, Lesser Sunda Island , Northern Marianas , Samoa, Vanuatu, Fiji
45194688	tropical Asia
45194692	Philippines , Borneo , Sulawesi, Lesser Sunda Island , Nepal
45194700	Taiwan, Philippines , Palawan
45194719	Ecuador
45194720	Fiji , Vanuatu, New Guinea, Moluccas , Solomon Island 
45194728	Guinea, Sierra Leone, Liberia, Ivory Coast, Ghana, Nigeria, Cameroon, Bioko Island , Gabon, Congo , D.R. Congo 
45194734	China , Taiwan, Myanmar , Thailand, Vietnam, Sumatra, Java, Lesser Su

The cleaned data was split into a working list, and the final list was exported as a .csv

In [6]:
#split cleaned_data into working list
working_data = cleaned_data[3:].replace("\t",":'").replace(",", "','").replace("',' ", "','").replace(" ','", "','")\
    .replace("','", "', '").replace(" \n", "\n").replace("\n", "'\n")

In [7]:
mid_outfile = 'cleaned_' + distribution
file_out = open(mid_outfile, 'w')
print(working_data, file=file_out)

The cleaned .csv was reimported and put into a dictionary, where __keys__ are the species ID and __values__ are the distribution (countries and places).

In [8]:
# Creating distribution dictionary from cleaned .txt
mid_infile = mid_outfile

t = open(mid_infile, 'r')
read_list = t.readlines()
clean_replace = [i.replace('\n', '').replace("'", "") for i in read_list]
clean_replace

['45194626:New Guinea',
 '45194627:Ecuador',
 '45194628:Bolivia, Ecuador, Peru',
 '45194632:Mexico, Belize, Guatemala, Honduras, El Salvador, Nicaragua, Costa Rica, Panama, Colombia, Venezuela, Ecuador, Cuba, Jamaica, Hispaniola, Puerto Rico, St. Eustatius, St. Kitts, Montserrat, Guadeloupe, Dominica, Martinique, St. Lucia, St. Vincent, Trinidad, Tobago',
 '45194645:India',
 '45194655:Madagascar',
 '45194658:Australia, Solomon Island, Bonin Island, Volcano Island, New Guinea, Bismarck Arch., Moluccas, Philippines, Sulawesi, Java, Lesser Sunda Island, Northern Marianas, Samoa, Vanuatu, Fiji',
 '45194688:tropical Asia',
 '45194692:Philippines, Borneo, Sulawesi, Lesser Sunda Island, Nepal',
 '45194700:Taiwan, Philippines, Palawan',
 '45194719:Ecuador',
 '45194720:Fiji, Vanuatu, New Guinea, Moluccas, Solomon Island',
 '45194728:Guinea, Sierra Leone, Liberia, Ivory Coast, Ghana, Nigeria, Cameroon, Bioko Island, Gabon, Congo, D.R. Congo',
 '45194734:China, Taiwan, Myanmar, Thailand, Vietnam,

In [9]:
species = {}
for each in clean_replace:
    chunk = each.replace(":",",").split(",")
    key = chunk[0]
    value = chunk[1:]
    species[key] = value

Next, the geopolitical dictionary from the GenBank_toPandas.ipnyb script was migrated over to convert distribution places into geopolitical regions. The dictionary below was modified to accomodate all places that did not automatically match with the countries. For more information about places were added to the dictionary and the reasoning behind each decision, please see the documentation.

In [10]:
geo_dict = {"Northern Africa": {"Algeria", "Egypt", "Libya", "Morocco", "Sudan", "Tunisia", "Western Sahara", "Sinai peninsula",
                               "Malakal"},
                    "Eastern Africa": {"British Indian Ocean Territory", "Burundi", "Comoros", "Djibouti", "Eritrea",
                                       "Ethiopia", "French Southern Territories", "Kenya", "Madagascar", "Malawi",
                                       "Mauritius", "Mayotte", "Mozambique", "La Runion", "Rwanda", "Seychelles", "Somalia",
                                       "South Sudan", "Uganda", "United Republic of Tanzania", "Tanzania", "Zambia", "Zimbabwe", 
                                       "Reunion", "Pemba Island", "Limpopo", "Rodrigues", "Rodrigues Island", "Zanzibar", "Agalega Island"},
                    "Middle Africa": {"Angola", "Cameroon", "Central African Republic", "Chad", "Congo",
                                      "Democratic Republic of the Congo", "Equatorial Guinea", "Equatorial", "Guinea", "Gabon",
                                      "Sao Tome and Principe", "D.R.Congo", "Annobon Island", "Sao Tome", "Bioko Island", "Sao Tome",
                                     "Principe Island", "D.R. Congo"},
                    "Southern Africa": {"Botswana", "Eswatini", "Lesotho", "Namibia", "South Africa", "Gauteng", "Swaziland",
                                       "Marion Island", "Natal", "Mpumalanga"},
                    "Western Africa": {"Benin", "Burkina Faso", "Cabo Verde", "Côte d'Ivoire", "Gambia", "Ghana", "Guinea",
                                       "Guinea-Bissau", "Liberia", "Mali", "Mauritania", "Niger", "Nigeria", "Saint Helena",
                                       "Senegal", "Sierra Leone", "Togo", "Bissau", "Cape Verde Island", "Ivory Coast"},
                    "Caribbean": {"Anguilla", "Antigua", "Aruba", "Bahamas", "Barbados", "Bonaire",
                                  "British Virgin Islands", "Cayman Islands", "Cuba", "Curacao", "Dominica", "Dominican Republic",
                                  "Grenada", "Guadeloupe", "Haiti", "Jamaica", "Martinique", "Montserrat", "Puerto Rico",
                                  "Saint Barthélemy", "Saint Kitts & Nevis", "Saint Lucia", "Saint Martin",
                                  "Saint Vincent and the Grenadines", "Sint Maarten", "Saba", "Trinidad", "Tobago", "Trinidad and Tobago",
                                  "Turks & Caicos Islands", "Virgin Islands", "Mona Island", "Nevis", "St. Croix", "Hispaniola",
                                 "Greater Antilles", "Marie Galante Island", "Grenadines", "Turks & Caicos Island", "St. Kitts",
                                 "Anegada", "St. Barthelemy", "St. Thomas", "La Desirade", "Cayman Island", "La Desiderade", "Tortola",
                                 "U.S. Virgin Island", "Les Saintes Island", "Virgin Island", "Culebra Island", "Island Margarita",
                                 "St. Thomas Island", "Redonda", "Nevis Island", "St. Eustatius", "Virgin Gorda", "Lesser Antilles",
                                 "Vieques Island", "St. Lucia", "St. Martin", "Guana", "Marie-galante", "Marie-galante Island"},
                    "Central America": {"Belize", "Costa Rica", "Isla del Coco", "El Salvador", "Guatemala", "Honduras",
                                        "Mexico", "Nicaragua", "Panama", "Campeche", "Merida", "Island Guadalupe", "Queretaro",
                                       "Yucatan", "Distrito Federal", "Central America", "Morelos", "Revillagigedos Island", "Mexico State",
                                       "Guerrero", "Veracruz", "Puebla", "Oaxaca", "Chiapas", "Colima", "Quintana Roo", "Tamaulipas",
                                       "Island del Coco"},
                    "South America": {"Argentina", "Bolivia", "Bouvet Island", "Brazil", "Chile", "Colombia", "Ecuador",
                                      "Falkland Islands", "Falkland Islands (Islas Malvinas)", "French Guiana", "Guyana", "Paraguay",
                                      "Peru", "South Georgia", "South Sandwich Islands", "Suriname", "Uruguay", "Venezuela", "Bahia",
                                     "Galapagos Island", "Guarico", "Galapagos", "Entre Rios", "Sucre", "South America", "Surinam",
                                     "Rio Grande do Sul", "Tachira", "Catamarca", "Carabobo", "Yaracuy", "Putumayo", "Misiones",
                                     "Mato Grosso do Sul", "Espirito Santo", "Goias", "Norte de Santander", "Portuguesa", "Rio de Janeiro",
                                     "Barinas", "Chaco", "Cauca", "Jujuy", "Zulia", "Sao Paulo", "Nueva Esparta", "Cape Prov.",
                                     "Juan Fdz. Island", "Corrientes", "Formosa Prov.", "Guianas", "Huila", "Parana", "trop. South America",
                                     "Cundinamarca", "Juan Fernandez Island", "Malpelo Island", "Alagoas", "subantarctic South America",
                                     "Pernambuco", "Tolima", "Monagas", "Tucuman", "Salta", "Amazonas", "Santa Catarina"},
                    "Northern America": {"Bermuda", "Canada", "Greenland", "Saint Pierre & Miquelon", "USA", "Antarctica",
                                         "French Southern and Antarctic Lands", "Kerguelen Archipelago", 'Alabama', 'Alaska',
                                         'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
                                         'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
                                         'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
                                         'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina',
                                         'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
                                         'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia',
                                         'Wisconsin', 'Wyoming', 'District of Columbia', "Baja California", "Kerguelen Island", "St. Pierre et Miquelon",
                                        "Sitka Island", "San Clemente Island", "Santa Cruz Island", "Santa Fe", "Molokai", "Kahoolawe Island",
                                        "Carolina Island", "Hawaii Island", "Oahu", "Maui", "Island Crozet", "Kahoolawe", "Kauai",
                                        "Lanai", "St. Paul Island", "Ontario", "Iles Kerguelen"},
                    "Central Asia": {"Kazakhstan", "Kyrgyzstan", "Tajikistan", "Turkmenistan", "Uzbekistan"},
                    "Eastern Asia": {"China", "Hong Kong", "Japan", "Macao", "Mongolia", "North Korea", "South Korea", "Korea", "Taiwan",
                                    "Hebei", "Tsushima Island", "Ryukyu Island", "Sichuan", "Tibet", "Yakushima Island", "Guangxi",
                                    "Xinjiang", "Volcano Island", "Izu Island", "Yunnan", "Cheju Island", "Bonin Island", "Tokara Island",
                                    "Liaoning", "Shanxi", "Shaanxi", "Tanegashima Island", "Henan", "Nei Mongol", "Hainan"},
                    "Southeastern Asia": {"Borneo", "Brunei", "Cambodia", "Indonesia", "Laos", "Lesser Sunda Islands",
                                          "Malaysia", "Moluccas", "Myanmar", "Philippines", "Singapore", "Thailand",
                                          "Vietnam", "Viet Nam", "Pulau Langkawi", "Selangor", "Malaysia Selangor",
                                         "Langkawi", "Sabah", "Sipora Island", "Luzon", "Lesser Sunda Island", "Pulau Tioman", "Timor",
                                         "Palawan", "Burma", "Sulawesi", "Malay Islands", "Sumatra", "Lombok", "Gunung Leuser",
                                         "Perak", "Negeri Sembilan", "Java", "Tawi-Tawi", "Pahang", "Flores Island", "Sarawak",
                                         "Johor", "Malesia", "Langkawi Island", "Mindanao"},
                    "Southern Asia": {"Afghanistan", "Bangladesh", "Bhutan", "India", "Iran", "Maldives", "Nepal",
                                      "Pakistan", "Sri Lanka", "Manipur", "Pakistani Kashmir", "Andaman Island", "Tripura",
                                     "Uttarakhand", "Nicobar Island", "Jammu & Kashmir", "Himachal Pradesh", "Meghalaya", "Assam State",
                                     "Nagaland"},
                    "Western Asia": {"Armenia", "Azerbaijan", "Bahrain", "Cyprus", "Georgia", "Iraq", "Israel", "Jordan",
                                     "Kuwait", "Lebanon", "Oman", "Qatar", "Saudi Arabia", "Palestine", "Syria", "Turkey",
                                     "United Arab Emirates", "Yemen", "Anatolia", "European Turkey", "Samha Island", "Jordania",
                                    "Socotra"},
                    "Eastern Europe": {"Belarus", "Bulgaria", "Czech Republic", "Hungary", "Poland", "Moldavia", "Moldova",
                                       "Romania", "Russia", "Russian Far East", "European Russia", "Slovakia", "Ukraine", "Siberia",
                                      "Northern Caucasus", "Crimea"},
                    "Northern Europe": {"Guernsey", "Jersey", "Sark", "Denmark", "Estonia", "Faroe Islands", "Finland",
                                        "Iceland", "Ireland", "Isle of Man", "Latvia", "Lithuania", "Norway", "Svalbard",
                                        "Jan Mayen", "Sweden", "United Kingdom", "Great Britain", "Ireland", "Scotland",
                                        "Wales", "Aland Islands", "England", "Scandinavia", "Spitsbergen"},
                    "Southern Europe": {"Albania", "Andorra", "Bosnia & Herzegovina", "Bosnia & Hercegovina", "Croatia", "Gibraltar",
                                        "Greece", "Holy See", "Italy", "Malta", "Montenegro", "Portugal", "San Marino", "Serbia",
                                        "Slovenia", "Spain", "former Yugoslavia", "Yugoslavia", "Macedonia", "Corvo Island",
                                        "Sao Miguel Island", "Tenerife", "Sardinia", "Serbia & Kosovo", "Madeira", "Rhodos",
                                        "Canary Islands", "East Aegaean Island", "Faial", "Azores", "Gran Canaria", "Sao Jorge",
                                       "La Palma Island", "Elba", "Crete", "Dalmatia", "Baleares", "Canary Island", "Sicily", "Karpathos",
                                       "Fuerteventura", "La Gomera", "Terceira"},
                    "Western Europe": {"Austria", "Belgium", "France", "Germany", "Liechtenstein", "Luxembourg", "Monaco",
                                       "Netherlands", "Switzerland", "Corsica", "Holland", "Westfalen", "Luxemburg"},
                    "Australia & New Zealand": {"Australia", "Christmas Island", "Cocos (Keeling) Islands", "Heard Island",
                                               "McDonald Islands", "New Zealand", "Norfolk Island", "New South Wales",
                                               "Lord Howe Island", "Queensland", "Tokelau Island", "Tasmania", "Chatham Island",
                                               "Kermadec Island", "Auckland Island", "Antipodes Island", "Macquarie Island", "Campbell Island",
                                               "Cook Island", "Snares Island"},
                    "Melanesia, Micronesia & Polynesia": {"Fiji", "New Caledonia", "Papua New Guinea", "New Guinea",
                                                          "Solomon Islands", "Vanuatu", "Guam", "Kiribati",
                                                          "Marshall Islands", "Micronesia", "Nauru", "Northern Marianas",
                                                          "Marianas Island", "Southern Marianas", "Palau",
                                                          "United States Minor Outlying Islands", "American Samoa",
                                                          "Western Samoa", "Cook Islands","French Polynesia", "Niue",
                                                          "Pitcairn", "Samoa", "Tokelau", "Tonga", "Tuvalu", "Ua Pou",
                                                          "Wallis & Futuna Islands", "Austral Island", "Gilbert Island", "Peleliu",
                                                          "Palau Island", "Island of Pines", "Oeno Atoll", "Fatu Hiva", "Line Island",
                                                         "Marianas", "Tahuata", "Rotuma Island", "Marquesas", "Wallis Island",
                                                         "Bismarck Island", "Hiva Oa", "Futuna Island", "Morobe District",
                                                         "Society Island", "Marquesas Island", "Rota Island", "Bismarck Arch.", "Tinian",
                                                         "Tuamotu Arch.", "Tahiti", "Polynesia", "Tuvalu Island", "Melanesia", "Nuku Hiva",
                                                         "Wallis and Futuna Island", "Pohnpei", "Solomon Island", "Easter Island", "Saipan"},
                   "Undefined": {"Indian Ocean", "Victoria", "St. Vincent", "Vaups", "Henderson Island", "Chagos Arch.", "Himalaya", "Minanda Island",
                                "Tristan da Cunha", "Boyac", "Pico", "St. Helena", "Gambier Island", "Kuril Island", "North America",
                                "Acre", "origin unknown", "Ile Amsterdam", "Iles Crozet", "Eastern Mediterranean", "Flores", "Channel Island",
                                "Saharan Mts.", "Ascension Island", "Caucasus", "Prince Edward Island", "Alps", "trop. Africa", "Lara",
                                "cultivated origin", "So Tom", "Moen", "Island Amsterdam", "Santander","North Island", "Valle",
                                 "Inaccessible Island", "St. John", "Tristan dAcunha", "cult.", "Distribution unknown", "Hort.",
                                "New Ile Amsterdam", "Pahan", "Channel Islands", "Pacific", "Miranda", "Pitcairn Island", "Choc",
                                "Caldas", "Urals", "trop. America", "Cordoba", "Indochina", "Rivera", "Ascension", "and surrounding islands",
                                "St. Christopher", "trop. Asia", "Thringen", "Middle Island", "Gough Island", "Tabasco", "Nightingale Island",
                                "American Island", "Trujillo", "Falcon", "tropical Asia", "Asia", "Islands Selvagens", "Stoltenhoff Island",
                                "South Georgia Island", "Magdalena", "Aleutes", "Carpathians", "Europe", "remains uncertain",
                                "", "Cult.", "etc.", "]", "from"},
                   "missing": {"missing"}}



Key and value pairs were flipped in geo_dict so that __keys__ are countries/places and __values__ are geopolitical regions.

In [11]:
geo = {}

for g, countrylist in geo_dict.items():
    for country in countrylist:
        geo[country] = g
        
print(geo)

{'Tunisia': 'Northern Africa', 'Libya': 'Northern Africa', 'Malakal': 'Northern Africa', 'Algeria': 'Northern Africa', 'Western Sahara': 'Northern Africa', 'Sudan': 'Northern Africa', 'Egypt': 'Northern Africa', 'Sinai peninsula': 'Northern Africa', 'Morocco': 'Northern Africa', 'Zimbabwe': 'Eastern Africa', 'Mauritius': 'Eastern Africa', 'Rodrigues Island': 'Eastern Africa', 'Tanzania': 'Eastern Africa', 'La Runion': 'Eastern Africa', 'Zambia': 'Eastern Africa', 'United Republic of Tanzania': 'Eastern Africa', 'Djibouti': 'Eastern Africa', 'Mayotte': 'Eastern Africa', 'Kenya': 'Eastern Africa', 'Comoros': 'Eastern Africa', 'South Sudan': 'Eastern Africa', 'Rodrigues': 'Eastern Africa', 'Eritrea': 'Eastern Africa', 'Ethiopia': 'Eastern Africa', 'Somalia': 'Eastern Africa', 'French Southern Territories': 'Eastern Africa', 'Burundi': 'Eastern Africa', 'Pemba Island': 'Eastern Africa', 'Rwanda': 'Eastern Africa', 'Agalega Island': 'Eastern Africa', 'Madagascar': 'Eastern Africa', 'Reunion

Places from the cleaned_locationdescription.txt were compared to the geopolitical dictionary. Then unique geopolitical areas were aggregated into a list. The raw places were also aggregate into a list. Both lists were stored as values in the final_distribution dictionary.

In [12]:
final_distribution = {}

known_countries = geo.keys()
missing_geos = []
for critter, value in species.items():
    countries = value             #return list of countries
    found_geos = []
    geopolitical = []
    for country in countries:  
        c = country.strip()
        geopolitical.append(geo[c])
        if c in known_countries:
            found_geos.append(c)
        else:
            missing_geos.append(c)
# print("MISSING VALUES:", set(missing_geos))
    
    joined_geo = ";".join(list(set(geopolitical)))
    joined_cou = ";".join(found_geos)

    final_distribution[critter] = [joined_geo], [joined_cou]
print(final_distribution)   

{'45194626': (['Melanesia, Micronesia & Polynesia'], ['New Guinea']), '45194627': (['South America'], ['Ecuador']), '45194628': (['South America'], ['Bolivia;Ecuador;Peru']), '45194632': (['Undefined;Caribbean;Central America;South America'], ['Mexico;Belize;Guatemala;Honduras;El Salvador;Nicaragua;Costa Rica;Panama;Colombia;Venezuela;Ecuador;Cuba;Jamaica;Hispaniola;Puerto Rico;St. Eustatius;St. Kitts;Montserrat;Guadeloupe;Dominica;Martinique;St. Lucia;St. Vincent;Trinidad;Tobago']), '45194645': (['Southern Asia'], ['India']), '45194655': (['Eastern Africa'], ['Madagascar']), '45194658': (['Eastern Asia;Melanesia, Micronesia & Polynesia;Australia & New Zealand;Southeastern Asia'], ['Australia;Solomon Island;Bonin Island;Volcano Island;New Guinea;Bismarck Arch.;Moluccas;Philippines;Sulawesi;Java;Lesser Sunda Island;Northern Marianas;Samoa;Vanuatu;Fiji']), '45194688': (['Undefined'], ['tropical Asia']), '45194692': (['Southern Asia;Southeastern Asia'], ['Philippines;Borneo;Sulawesi;Lesse

The dictionary was then imported into a dataframe, which was exported into a .csv for use within the main jupyter notebook, CatalogueLife_Distribution.ipynb.

In [13]:
df = pd.DataFrame.from_dict(final_distribution, orient='index')
df_distribution = df.reset_index()
df_distribution.columns=['taxonID', 'geopolitical_regions','location_distribution']
df_distribution.head()

Unnamed: 0,taxonID,geopolitical_regions,location_distribution
0,45194626,"[Melanesia, Micronesia & Polynesia]",[New Guinea]
1,45194627,[South America],[Ecuador]
2,45194628,[South America],[Bolivia;Ecuador;Peru]
3,45194632,[Undefined;Caribbean;Central America;South Ame...,[Mexico;Belize;Guatemala;Honduras;El Salvador;...
4,45194645,[Southern Asia],[India]


In [14]:
df_distribution.to_csv(outfile)