In [1]:
import pandas as pd

In [2]:
# import observations
obs = pd.read_hdf(
    'clean_data.h5',
    'obs',
)
obs.shape

(1426881, 63)

In [3]:
media = pd.read_csv(
    'fungi/multimedia.txt',
    delimiter='\t',
)
media.shape

(2803568, 15)

# Reduce 'media' columns

In [4]:
media.columns.tolist()

['gbifID',
 'type',
 'format',
 'identifier',
 'references',
 'title',
 'description',
 'source',
 'audience',
 'created',
 'creator',
 'contributor',
 'publisher',
 'license',
 'rightsHolder']

In [5]:
f"Number of observations with media: {len(media['gbifID'].value_counts())}"

'Number of observations with media: 1387575'

In [6]:
media.head()

Unnamed: 0,gbifID,type,format,identifier,references,title,description,source,audience,created,creator,contributor,publisher,license,rightsHolder
0,891021265,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/170720,,,,,2012-08-31,kirsten,,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,kirsten
1,891023450,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,http://www.flickr.com/photos/loarie/8662398374/,,,,,2013-04-18,Don Loarie,,iNaturalist,http://creativecommons.org/licenses/by/4.0/,Don Loarie
2,891023450,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,http://www.flickr.com/photos/loarie/8662397760/,,,,,2013-04-18,Don Loarie,,iNaturalist,http://creativecommons.org/licenses/by/4.0/,Don Loarie
3,891023450,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,http://www.flickr.com/photos/loarie/8661298161/,,,,,2013-04-18,Don Loarie,,iNaturalist,http://creativecommons.org/licenses/by/4.0/,Don Loarie
4,891023450,StillImage,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,http://www.flickr.com/photos/loarie/8677081358/,,,,,2013-04-18,Don Loarie,,iNaturalist,http://creativecommons.org/licenses/by/4.0/,Don Loarie


## Remove columns with all NaN

In [7]:
nan_cols = []
for col in media.columns:
    if ( media[ col ].isnull().all() ):
        nan_cols.append( col )
nan_cols

['title', 'description', 'source', 'audience', 'contributor']

In [8]:
media = media.drop( nan_cols, axis = 1 )
media.columns.tolist()

['gbifID',
 'type',
 'format',
 'identifier',
 'references',
 'created',
 'creator',
 'publisher',
 'license',
 'rightsHolder']

## Type

In [9]:
media['type'].value_counts()

StillImage    2803519
Sound               1
Name: type, dtype: int64

In [10]:
# Remove that 'Sound' media-type datapoint,
# or better said, only keep 'StillImage' media-types
media = media[ media['type'] == 'StillImage' ]
media['type'].value_counts()

StillImage    2803519
Name: type, dtype: int64

In [11]:
# Drop the now unnecesssary 'type' column
media = media.drop( 'type', axis=1 )

## Format

In [12]:
media['format'].value_counts()

image/jpeg     2796151
image/png         7252
image/pjpeg         77
image/gif           39
Name: format, dtype: int64

All formats listed above should be acceptable... gifs may need to be converted before use but that should be trivial.

We should keep this column for future use.

## Identifier

In [13]:
media['identifier'].value_counts()

https://inaturalist-open-data.s3.amazonaws.com/photos/23611710/original.jpg     5
https://inaturalist-open-data.s3.amazonaws.com/photos/57294691/original.jpeg    5
https://inaturalist-open-data.s3.amazonaws.com/photos/14131733/original.jpeg    4
https://inaturalist-open-data.s3.amazonaws.com/photos/23611704/original.jpg     4
https://inaturalist-open-data.s3.amazonaws.com/photos/14131737/original.jpeg    4
                                                                               ..
https://inaturalist-open-data.s3.amazonaws.com/photos/170591197/original.jpg    1
https://inaturalist-open-data.s3.amazonaws.com/photos/60310042/original.jpeg    1
https://inaturalist-open-data.s3.amazonaws.com/photos/114746908/original.jpg    1
https://inaturalist-open-data.s3.amazonaws.com/photos/115333554/original.jpg    1
https://inaturalist-open-data.s3.amazonaws.com/photos/243535450/original.jpg    1
Name: identifier, Length: 2802842, dtype: int64

Looks like URLs for each image, but why is the same URL used for multiple rows?

Let's look at one of these 'identifier' values and see what rows it is used in and why that might be.

In [14]:
media[ media['identifier'] == 'https://inaturalist-open-data.s3.amazonaws.com/photos/23611710/original.jpg']

Unnamed: 0,gbifID,format,identifier,references,created,creator,publisher,license,rightsHolder
39191,1898835220,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/23611710,2018-08-22T17:32:54Z,Spencer Hardy,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Spencer Hardy
39197,1898835445,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/23611710,2018-08-22T17:32:54Z,Spencer Hardy,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Spencer Hardy
1721422,1898835473,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/23611710,2018-08-22T17:32:54Z,Spencer Hardy,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Spencer Hardy
2281623,1898835509,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/23611710,2018-08-22T17:32:54Z,Spencer Hardy,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Spencer Hardy
2281625,1898835514,image/jpeg,https://inaturalist-open-data.s3.amazonaws.com...,https://www.inaturalist.org/photos/23611710,2018-08-22T17:32:54Z,Spencer Hardy,iNaturalist,http://creativecommons.org/licenses/by-nc/4.0/,Spencer Hardy


Let's take a look at the image:
![sample](https://inaturalist-open-data.s3.amazonaws.com/photos/23611710/original.jpg)

Oh, it looks like the same image has multiple observations (of different species) within it! This will make things a bit more complicated. We will need to identify each individual mushroom and where they are in the image (using a pre-trained model provided by an API most likely). There will also likely be other images that have multiple individual mushrooms of the same species within the image that will need to be identified and separated out from each other.

Those are some tasty looking chanterelles...

## References

In [15]:
media['references'].value_counts()

https://www.inaturalist.org/photos/23611710     5
https://www.inaturalist.org/photos/57294691     5
https://www.inaturalist.org/photos/23611704     4
https://www.inaturalist.org/photos/14131737     4
https://www.inaturalist.org/photos/75896078     4
                                               ..
https://www.inaturalist.org/photos/115333554    1
https://www.inaturalist.org/photos/115333588    1
https://www.inaturalist.org/photos/115793758    1
https://www.inaturalist.org/photos/115793779    1
https://www.inaturalist.org/photos/243535450    1
Name: references, Length: 2802837, dtype: int64

It looks like these are the URLs for the images (1-1 relationship between 'identifier' and 'references') on iNaturalist's website - unnecessary and redundant for our purposes

In [16]:
media = media.drop( 'references', axis=1 )

## Created

In [17]:
media['created'].value_counts()

2016-10-04                   109
2015-08-23                   105
2017-10-13                    92
2021-10-10                    89
2018-09-02                    89
                            ... 
2010-05-01T14:57:55-07:00      1
2010-05-01T15:00:59-07:00      1
2010-05-01T14:58:53-07:00      1
2010-05-01T14:57:36-07:00      1
2022-10-07T12:34Z              1
Name: created, Length: 2195959, dtype: int64

Timestamp of when the media was created - may need to be parsed and may be redundant to timestamp in observations column - also, it's not really necessary for our purposes. All we care about in terms of timestamps is when the observation was made which is in the 'observations' dataset.

In [18]:
media = media.drop( 'created', axis=1 )

## Creator

In [19]:
media['creator'].value_counts()

Reiner Richter     44681
Damon Tighe        17817
John Plischke      13530
huafang            12248
Stephen Russell    12238
                   ...  
stana20                1
leila_sdr              1
karensturm10           1
lbrva                  1
laraamorim             1
Name: creator, Length: 142184, dtype: int64

Shows who has contributed the most images - unnecessary for our purposes

In [20]:
media = media.drop( 'creator', axis=1 )

## Publisher

In [21]:
media['publisher'].value_counts()

iNaturalist    2803519
Name: publisher, dtype: int64

All are the same publisher, iNaturalist, so we can get rid of this column:

In [22]:
media = media.drop( 'publisher', axis=1 )

## License

In [23]:
media['license'].value_counts()

http://creativecommons.org/licenses/by-nc/4.0/       2322273
http://creativecommons.org/licenses/by/4.0/           248770
http://creativecommons.org/publicdomain/zero/1.0/      88658
http://creativecommons.org/licenses/by-nc-sa/4.0/      77292
http://creativecommons.org/licenses/by-sa/4.0/         43114
http://creativecommons.org/licenses/by-nc-nd/4.0/      22932
http://creativecommons.org/licenses/by-nd/4.0/           480
Name: license, dtype: int64

Column not necessary for our purposes

In [24]:
media = media.drop( 'license', axis=1 )

## Rights Holder

In [25]:
media['rightsHolder'].value_counts()

Reiner Richter     44681
Damon Tighe        17817
John Plischke      13530
huafang            12248
Stephen Russell    12238
                   ...  
stana20                1
leila_sdr              1
karensturm10           1
lbrva                  1
laraamorim             1
Name: rightsHolder, Length: 142184, dtype: int64

Again, columns is not necessary for our purposes

In [26]:
media = media.drop( 'rightsHolder', axis=1 )

In [27]:
media.columns

Index(['gbifID', 'format', 'identifier'], dtype='object')

# Save cleaned data

In [30]:
media.to_hdf( 'clean_data.h5', 'media' )