#### This exploratory work involves reading a JSON file("vdbRdf.json" in this project(https://dharc-org.github.io/vespasiano-da-bisticci-letters-de/documentation/downloads.html)), analyzing the 'place' column, filtering and extracting non-null places, extracting specific information from a nested structure, removing duplicates, and saving the unique places into a CSV file. These steps provide insights into the places data and lay the groundwork for further analysis or processing. Here is a summary of the exploratory work conducted:

1. **Reading the JSON file**:
   - The code begins by reading the contents of the JSON file using the `read_json()` function and assigning it to the variable `data_json`.
   - It then displays information about the structure and contents of the `data_json` object using the `info()` method.

2. **Creating a DataFrame**:
   - The JSON file is read again using pandas' `read_json()` function, creating a DataFrame named `df_data_json`.
   - This allows for more convenient data manipulation and analysis.

3. **Analyzing the 'place' column**:
   - The code focuses on analyzing the column with the key `'http://purl.org/vocab/bio/0.1/place'` within the DataFrame `df_data_json`.
   - It uses the `describe()` method to obtain a statistical summary of the column's data, including measures such as count, unique values, and descriptive statistics.

4. **Filtering and extracting non-null places**:
   - The code filters the rows in `df_data_json` where the `'place'` column is not null (not missing), and retrieves the corresponding values.
   - These filtered values are assigned to a new DataFrame named `df_2`.
   - This step ensures that only valid places are considered for further analysis.

5. **Extracting specific information**:
   - The code extracts specific information from the nested structure within `df_2`.
   - It retrieves the value at row index 2 and column index 0, accessing the `'@id'` key within the nested structure.
   - This step assumes a specific data structure and extracts a particular piece of information from it.

6. **Removing duplicates**:
   - The code creates a new DataFrame named `df_3_dropped` by removing any duplicate rows from `df_2`.
   - This ensures that only unique places are retained for further analysis or processing.

7. **Statistical summary of unique places**:
   - The `describe()` method is applied to `df_3_dropped`, providing a statistical summary of the DataFrame's data.
   - This summary includes measures such as count, mean, standard deviation, and quartiles for numeric columns.

8. **Saving unique places**:
   - The code saves the contents of `df_3_dropped` into a CSV file named "unique_places.csv" using the `to_csv()` method.
   - This allows for easy storage and further analysis of the unique places extracted from the JSON file.

In [1]:
from pandas import *
import pandas as pd

## reading "vdbRdf.json" from Prof.Tomasi's project
### First I read json file to know about columns
### I found column nunmber 20 is URI for place

In [6]:
data_json = read_json("vdbRdf.json")
data_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4165 entries, 0 to 4164
Data columns (total 54 columns):
 #   Column                                                                        Non-Null Count  Dtype 
---  ------                                                                        --------------  ----- 
 0   @id                                                                           4165 non-null   object
 1   @type                                                                         4165 non-null   object
 2   http://purl.org/vocab/frbr/core#exemplar                                      362 non-null    object
 3   http://www.w3.org/2000/01/rdf-schema#label                                    3308 non-null   object
 4   http://purl.org/dc/terms/bibliographicCitation                                202 non-null    object
 5   http://purl.org/dc/terms/description                                          3023 non-null   object
 6   http://purl.org/spar/fabio/hasPublicatio

### place has many NaN values:

In [7]:
data_json["http://purl.org/vocab/bio/0.1/place"]

0                                                     NaN
1                                                     NaN
2                                                     NaN
3                                                     NaN
4                                                     NaN
                              ...                        
4160                                                  NaN
4161                                                  NaN
4162                                                  NaN
4163                                                  NaN
4164    [{'@id': 'http://vespasianodabisticciletters.u...
Name: http://purl.org/vocab/bio/0.1/place, Length: 4165, dtype: object

In [9]:
df_data_json = pd.read_json("vdbRdf.json")

### now we see that there is only 34 unique rows

In [7]:
df_data_json['http://purl.org/vocab/bio/0.1/place'].describe(include="all")

count                                                   244
unique                                                   34
top       [{'@id': 'http://vespasianodabisticciletters.u...
freq                                                     81
Name: http://purl.org/vocab/bio/0.1/place, dtype: object

In [13]:
# df_3.reset_index(inplace=True, drop=True)

In [18]:
df_data_json.keys()

Index(['@id', '@type', 'http://purl.org/vocab/frbr/core#exemplar',
       'http://www.w3.org/2000/01/rdf-schema#label',
       'http://purl.org/dc/terms/bibliographicCitation',
       'http://purl.org/dc/terms/description',
       'http://purl.org/spar/fabio/hasPublicationYear',
       'http://purl.org/vocab/frbr/core#embodiment',
       'http://purl.org/vocab/frbr/core#partOf',
       'http://purl.org/dc/terms/creator',
       'http://purl.org/dc/terms/references',
       'http://purl.org/dc/terms/source', 'http://purl.org/dc/terms/type',
       'http://purl.org/spar/c4o/hasContent',
       'http://purl.org/dc/terms/isReferencedBy',
       'http://purl.org/vocab/bio/0.1/birth',
       'http://purl.org/vocab/bio/0.1/death',
       'http://purl.org/spar/pro/isHeldBy',
       'http://purl.org/spar/pro/relatesTo',
       'http://purl.org/spar/pro/withRole',
       'http://purl.org/vocab/bio/0.1/place',
       'http://www.ontologydesignpatterns.org/cp/owl/timeindexedsituation.owl#atTime',


In [8]:
df_data_json[df_data_json['http://purl.org/vocab/bio/0.1/place'].notna()]['http://purl.org/vocab/bio/0.1/place']

13      [{'@id': 'http://vespasianodabisticciletters.u...
14      [{'@id': 'http://vespasianodabisticciletters.u...
17      [{'@id': 'http://vespasianodabisticciletters.u...
21      [{'@id': 'http://vespasianodabisticciletters.u...
40      [{'@id': 'http://vespasianodabisticciletters.u...
                              ...                        
4048    [{'@id': 'http://vespasianodabisticciletters.u...
4050    [{'@id': 'http://vespasianodabisticciletters.u...
4085    [{'@id': 'http://vespasianodabisticciletters.u...
4109    [{'@id': 'http://vespasianodabisticciletters.u...
4164    [{'@id': 'http://vespasianodabisticciletters.u...
Name: http://purl.org/vocab/bio/0.1/place, Length: 244, dtype: object

In [9]:
df_2 = df_data_json[df_data_json['http://purl.org/vocab/bio/0.1/place'].notna()]['http://purl.org/vocab/bio/0.1/place']

### reading a place uri for example:

In [19]:
df_2.iloc[2][0]['@id']

'http://vespasianodabisticciletters.unibo.it/roma'

### dropping the duplicates:

In [20]:
df_3_dropped = df_2.drop_duplicates()

In [21]:
df_3_dropped.describe()

count                                                    34
unique                                                   34
top       [{'@id': 'http://vespasianodabisticciletters.u...
freq                                                      1
Name: http://purl.org/vocab/bio/0.1/place, dtype: object

### creating a csv file for unique places:

In [22]:
df_3_dropped.to_csv('unique_places.csv')