## Basic processing of the data file
#### Warning: part of this notebook modifies number of column and saves data to new data file. Avoid using the "Run All" option. Repeated running of some of these codes may create duplicate columns of "genus" and save that to the "...._with_genus.csv" file.

- count entries by scientific name
- create new column that's just the genus
    - if SN only has one word then keep it, otherwise take the first word  
- count entries by genus

Summary: 
- some columns are entirely null, we will get rid of them at some point
- by scientific name, there are 364 different species 
- by genus name, there are 109 different genera. 13 of these have only one data point, 39 of these have less than 10
- we will have problem doing cross validation with genera with only a few data points. We need to save some for test set and further split them for cross validation. It would be very imbalanced: some genera would have say 300 and some would have 1 point. This will make some stats 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('MLNS_Insects.csv')

In [3]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13461 entries, 0 to 13460
Data columns (total 48 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   cat_num                      13461 non-null  float64
 1   format                       13461 non-null  object 
 2   common_name                  11191 non-null  object 
 3   scientific_name              13461 non-null  object 
 4   background_species           128 non-null    object 
 5   recordist                    13461 non-null  object 
 6   date                         13437 non-null  object 
 7   year                         13442 non-null  float64
 8   month                        13440 non-null  float64
 9   day                          13437 non-null  float64
 10  time                         8411 non-null   float64
 11  country                      13434 non-null  object 
 12  country.state.county         13434 non-null  object 
 13  state           

# vvv
### The block below adds genus column and changes order of colums and saves the dataframe to file. Note that the cells that are NA in the original file are now just empty cells. 

In [61]:
# create genus column: take the first word in scientific name
df['genus'] = df['scientific_name'].apply(lambda x: x.split()[0])
# rearrange to have genus sit right before scientific name
cols = df.columns.tolist()
cols = cols[:3] + [cols[-1]] + cols[3:-1]
df = df[cols]

df.to_csv('MLSN_Insects_with_genus.csv', index=False)

# vvv
Counting stuff
<br> Re-read the file so one can run the following code without adding extra genus columns

In [9]:
df = pd.read_csv('MLSN_Insects_with_genus.csv')

In [10]:
# counts data points grouped by scientific name
name_counts = df.groupby('scientific_name').cat_num.count()
name_counts.sort_values()

scientific_name
Gryllus campestris              1
Pictonemobius                   1
Amphiacusta annulipes           1
Gryllotalpa                     1
Petaloptera                     1
                             ... 
Gryllus pennsylvanicus        280
Oecanthus quadripunctatus     301
Gryllus rubens                371
Anaxipha                      419
Gryllus                      2986
Name: cat_num, Length: 364, dtype: int64

In [11]:
# recount number of data points grouped by genus
genus_counts = df.groupby('genus').cat_num.count()
genus_counts.sort_values()

genus
Neocicada             1
Eriolus               1
Ephippitytha          1
Eneoptera             1
Dissosteira           1
                   ... 
Anaxipha            613
Neoconocephalus     661
Cycloptilum         736
Oecanthus          1206
Gryllus            4772
Name: cat_num, Length: 109, dtype: int64

In [59]:
# count rare occurance, genus
counts = 0
for i,v in genus_counts.items():
    if v < 10:
        print(i, v)
        counts += 1
print(counts)

Acheta 5
Acrididae 5
Antillicharis 4
Caribophyllum 4
Ceraia 9
Cocconotus 3
Diatrypa 3
Diceroprocta 1
Dissosteira 1
Eneoptera 1
Ephippitytha 1
Eriolus 1
Fidicina 2
Froggattina 8
Gryllacrididae 1
Henicopsaltria 4
Hispanogryllus 3
Idiostatus 7
Ischyra 6
Megatibicen 1
Miogryllus 2
Nemobius 1
Neocicada 1
Oreopedes 4
Paracyrtophyllus 4
Petaloptera 1
Phoebolampta 2
Phrixa 2
Plagiostira 6
Psaltoda 1
Pseudopleminia 5
Romalea 3
Scapsipedus 4
Steiroxys 3
Syntechna 1
Teleogryllus 2
Tibicen 9
Xabea 1
Xenogryllus 9
39


In [4]:
df = pd.read_csv("MLNS_Insects_Fams.csv")
df.groupby('fam_or_subfam').cat_num.count()

fam_or_subfam
Acrididae             17
Cicadidae            269
Conocephalinae      1489
Eneopterinae          51
Gryllacrididae         1
Gryllinae           4965
Gryllotalpidae       203
Hapithinae           643
Listroscelidinae      19
Mogoplistinae        758
Nemobiinae           792
Oecanthinae         1230
Phalangopsidae        19
Phaneropterinae     1063
Pseudophyllinae      271
Tettigoniinae        575
Trigonidiinae        879
Name: cat_num, dtype: int64

# vvv
Collect files that have no human voice (list created by robert)

In [47]:
no_voice_list = pd.read_csv('no_voice_files.csv', header=None)
filename_list = [name.split("\\")[-1] for name in no_voice_list[0]]

In [51]:
import os,shutil
big_folder = 'E:\chirpfiles\\actual_chirp_files\\'
new_folder = 'E:\chirpfiles\\no_voice_files\\'

for name in filename_list:
    scr = big_folder + name
    des = new_folder + name
    shutil.copy(scr, des)
