# Data import of polymer data

Data and methodology taken from: Estimation and Prediction of the Polymers’ Physical Characteristics Using the Machine Learning Models Polymers 2024, 16(1), 115; https://doi.org/10.3390/polym16010115. 

Github repository: https://github.com/catauggie/polymersML/tree/main

In [1]:
#import the pandas library
import pandas as pd

# Use pandas to import the polymer datafile into a dataframe
polyinfo = pd.read_excel('polyinfo homopolymer.xlsx')

In [2]:
#Print out the head of the dataframe to see if it looks ok
polyinfo.head()

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175.0,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,


In [3]:
# Determine the number of unique polymers in the dataset
polymers = list(polyinfo['polymer_name'].unique())
print(f'Number of polymers in the dataset = {len(polymers):,}')

Number of polymers in the dataset = 18,312


## Slice the data into smaller dataframes
This snippet of code iterates through a list of polymers, filters information from a DataFrame (`polyinfo`), and then appends the filtered information to a list named `poly_list`. Here's a breakdown of what each part of the code does:

1. `poly_list = []`: Initializes `poly_list` as an empty list. This list will store the filtered information for each polymer found in the `polymers` list.

2. `for p in range(len(polymers)):`: This line starts a loop that iterates over the indices of the `polymers` list. The `range(len(polymers))` generates a sequence of numbers from `0` to the length of the `polymers` list minus one, effectively iterating over each index in the `polymers` list.

3. `poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]`: For each iteration of the loop, this line filters the `polyinfo` DataFrame for rows where the value in the 'polymer_name' column matches the current polymer name in the loop (`polymers[p]`). The result of this filtering (which could be one or more rows of data) is assigned to the variable `poly`.

4. `poly_list.append(poly)`: This line appends the filtered DataFrame `poly` to the `poly_list`. After the loop completes, `poly_list` will contain a list of DataFrames, where each DataFrame corresponds to a filtered view of `polyinfo` for each polymer name in the `polymers` list.

In summary, this code filters a large DataFrame (`polyinfo`) for specific rows matching each polymer name in a list (`polymers`), and collects these filtered DataFrames into a list (`poly_list`). This separats data specific to each polymer for use in subsequent operations.

In [4]:
poly_list = []
for p in range(len(polymers)):
    poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]
    poly_list.append(poly)

In [5]:
# Show the dataframe in the list associated with the first entry (polyethene)
poly_list[0]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,
...,...,...,...,...,...,...,...,...
88,polyethene,P010001,CH2,Other physical property,Pvt relation pressure,55,42001.909205,MPa
89,polyethene,P010001,CH2,Other physical property,Pvt relation specific volume,1.127,0.0146,cm3/g
90,polyethene,P010001,CH2,Other physical property,Pvt relation temperature,180,4496404.537488,C
91,polyethene,P010001,CH2,Other physical property,Radiation resistance,26,39970.808057,Mrad


Use a similar method to the way we separated up the larger data frame to count the number of parameters we have for each polymer in the `poly_list`dataframe.  Append this to len_list

In [6]:
len_list = []
for pp in range(len(poly_list)):
    len_pp = len(poly_list[pp])
    len_list.append(len_pp)

Use the same iteration methodology to generate a list of polymer names present in `poly_list`

In [7]:
poly_names_list = []
for n in range(len(poly_list)):
    name_n = list(poly_list[n]['polymer_name'])[0]
    poly_names_list.append(name_n)

In [8]:
print(f'{len(poly_names_list):,}')

18,312


Extract a list of properties avaliable for each polymer in the list `poly_list``


In [9]:
poly_feats_list = []
for f in range(len(poly_list)):
    feat_n = list(poly_list[f]['property_name'])
    poly_feats_list.append(feat_n)

Take the three lists we produce a dictionary to aggregate information about polymers, including their names, the number of features (properties), and the names of these features. 


* **'polymer_name': poly_names_list:** Associates the key **'polymer_name'** with the list **poly_names_list**, which contains the names of polymers. This list was prepared by iterating over a list of DataFrames (`poly_list`), extracting the name of the polymer from each DataFrame, and storing these names.
* **'Number of features': len_list:** Links the key **'Number of features'** with the list **len_list**, which records the number of features for each polymer. This list was filled by iterating over `poly_list`, determining the length of the DataFrame (or specifically, the number of rows in the **'property_name'** column) for each polymer, and adding these lengths to **len_list**.
* **'Features names': poly_feats_list:** Connects the key **'Features names'** with the list **poly_feats_list**, containing lists of feature names for each polymer. This list of lists was compiled by iterating over `poly_list`, extracting all the feature names (values from the **'property_name'** column) for each polymer, and appending these lists to **poly_feats_list**.

In [10]:
df_poly_feats = {'polymer_name': poly_names_list, 
                 'Number of features': len_list,
                'Features names': poly_feats_list}

This dictionary is structured to be directly convertible into a pandas DataFrame.
The utility of creating a structured DataFrame (`df_poly_feats`) from the `poly_list` and associated information, rather than using `poly_list` directly, hinges on several factors that enhance data analysis, presentation, and manipulation capabilities. 

In [11]:
ddf = pd.DataFrame(df_poly_feats)
ddf.head()

Unnamed: 0,polymer_name,Number of features,Features names
0,polyethene,93,"[Density, Specific volume, Refractive index, C..."
1,poly(prop-1-ene),86,"[Density, Specific volume, Refractive index, C..."
2,poly(but-1-ene),41,"[Density, Specific volume, Refractive index, C..."
3,poly(pent-1-ene),16,"[Density, Specific volume, Glass transition te..."
4,poly(3-methylbut-1-ene),15,"[Density, Specific volume, Glass transition te..."


#### The advantage of structuring the data in this way:
1. **Structured Representation:**
Clarity and Accessibility: A DataFrame provides a tabular structure that is intuitive to work with. It clearly delineates polymer names, the number of features, and the features themselves in separate columns, making data access and manipulation straightforward.
Integration of Related Data: By consolidating polymer names, the number of features, and feature names into a single DataFrame, you establish a direct relationship among these elements. This integrated view facilitates analyses that involve multiple aspects of the data simultaneously.
2. **Enhanced Data Analysis Capabilities:**
Built-in Functions: pandas DataFrames support a wide array of built-in functions for data analysis, including statistical summaries, group-by operations, and pivot tables, which are not as readily accessible or efficient with a list of DataFrames or lists.
Filtering and Selection: Identifying and extracting data based on certain criteria (e.g., polymers with a specific number of features) is more straightforward with DataFrame operations.
3. **Ease of Data Manipulation:**
Modification: Adding, removing, or modifying data (such as adding another feature or updating feature names) is more efficient in a DataFrame structure. Changes can be applied across multiple rows or columns with simple commands.
Reshaping and Pivoting: Transforming the dataset to fit specific analysis needs (like pivoting or melting data for different views) is facilitated by DataFrame methods.
4. **Improved Data Visualization:**
Direct Plotting: pandas DataFrames integrate seamlessly with visualization libraries like matplotlib and seaborn, allowing for direct plotting of data. Visualizing the distribution of features across polymers or comparing the number of features between polymers can be achieved with minimal code.
5. **Data Export and Sharing:**
Interoperability: DataFrames can be easily exported to various formats (CSV, Excel, JSON) for sharing or further analysis, offering flexibility not inherently available with a custom list structure like poly_list.
6. **Error Reduction:**
Consistency: By structuring data into a DataFrame, you ensure a consistent format, which can reduce errors in data handling and analysis. The tabular format enforces a uniform structure, potentially catching inconsistencies in data types or missing values.

In [12]:
# We export this data just as a nice check point and to have it handy wen needed
ddf.to_excel("polymers_number_of_features.xlsx", index=True)

In [13]:
# And import it back just because we can and check the first rows 
df = pd.read_excel('polymers_number_of_features.xlsx')
df.head()

Unnamed: 0.1,Unnamed: 0,polymer_name,Number of features,Features names
0,0,polyethene,93,"['Density', 'Specific volume', 'Refractive ind..."
1,1,poly(prop-1-ene),86,"['Density', 'Specific volume', 'Refractive ind..."
2,2,poly(but-1-ene),41,"['Density', 'Specific volume', 'Refractive ind..."
3,3,poly(pent-1-ene),16,"['Density', 'Specific volume', 'Glass transiti..."
4,4,poly(3-methylbut-1-ene),15,"['Density', 'Specific volume', 'Glass transiti..."


In [14]:
# Do some checking to ensure the df has the same data as in poly_list
df['Features names'][5]

"['Density', 'Specific volume', 'Glass transition temperature', 'Melting temperature', 'Specific heat capacity cp', 'Specific heat capacity cv', 'Gas permeability coefficient p', 'Intrinsic viscosity eta', 'Radius of gyration', 'Softening temperature', 'SMILES']"

Output the fifth df member of poly_list so that we can check our methods of manipulation below.

In [15]:
poly_list[5]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
244,poly(hex-1-ene),P010007,C6H12,Physical property,Density,0.854,0.003341,g/cm3
245,poly(hex-1-ene),P010007,C6H12,Physical property,Specific volume,1.182,0.009742,cm3/g
246,poly(hex-1-ene),P010007,C6H12,Thermal property,Glass transition temperature,-48.3,61.255282,C
247,poly(hex-1-ene),P010007,C6H12,Thermal property,Melting temperature,30,7168.266667,C
248,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cp,0.4586,0.017784,cal/(g*C)
249,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cv,0.291,0.0,cal/(g*C)
250,poly(hex-1-ene),P010007,C6H12,Physicochemical property,Gas permeability coefficient p,0.0,0.0,cm3(STP)cm/(cm2*s*Pa)
251,poly(hex-1-ene),P010007,C6H12,Dilute solution property,Intrinsic viscosity eta,0.29,2.13798,dl/g
252,poly(hex-1-ene),P010007,C6H12,Dilute solution property,Radius of gyration,9.83,0.007707,nm
253,poly(hex-1-ene),P010007,C6H12,Heat characteristics,Softening temperature,-36,0.0,C


Access the first feature name in the list stored in the Features names column of the sixth row, concatenate '_cfdf' to this feature name, it returns the concatenated string.

In [16]:
# Check that we can concatenate to dataframes
ddf['Features names'][5][0] +'_' +'cfdf'

'Density_cfdf'

Access a specific DataFrame within the list of DataFrames (`poly_list`), and then filter that DataFrame based on a condition involving the `property_name` column. Specifically, filter the sixth DataFrame in `poly_list `(since Python uses zero-based indexing, poly_list[5] refers to the sixth element) for rows where the value in the `property_name` column matches the fifth feature name in the list of feature names stored in the sixth row of the `ddf` DataFrame's Features names column. 

Here's a breakdown:

* poly_list[5]: This accesses the sixth DataFrame within poly_list.
* poly_list[5]['property_name']: This selects the property_name column of the sixth DataFrame.
* ddf['Features names'][5][4]: This retrieves the fifth feature name from the list of feature names in the sixth row of ddf's Features names column. *Remember, both indices are zero-based, so this expression points to what would commonly be referred to as the fifth feature name of the sixth polymer.*

The complete expression filters the sixth DataFrame in poly_list to include only those rows where the property_name matches the specified feature name from ddf.

In [17]:
#Check that we can nicely access certain parts of the dataframe
poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][4]]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
248,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cp,0.4586,0.017784,cal/(g*C)


This code block iterates over the feature names listed for the sixth polymer in ddf['Features names'][5] and extracts specific statistics and units for each feature from the corresponding sixth DataFrame in poly_list. It organizes this extracted data into separate lists for median values, variance values, and units of the properties, along with their associated names. 

Here's a detailed explanation of each part:

* Initialization of Lists: Six lists are initialized to store the median values, variance values, and units for the features, along with their corresponding names. These lists will be populated with the extracted data for each feature of the polymer.
* Looping Over Feature Names: The loop iterates over each feature name for the sixth polymer in ddf.
* Extracting Median Values: For each feature, the code filters the sixth DataFrame in `poly_list` to rows where the `property_name` matches the current feature name being processed. It then selects the `property_value_median column`, converts it to a list, and takes the first element. This value is the median value of the property. A corresponding name for this median value is constructed by appending `_value_median` to the feature name and stored in `feat_median_values_name_l`.
* Extracting Variance Values: Similarly, the variance value for the property is extracted by filtering on the same feature name and selecting the `property_value_variance` column. The first element of this list is the variance value of the property. A corresponding name for this variance value is created by appending `_value_variance` to the feature name and stored in `feat_variance_values_name_l`.
* Extracting Units: The unit for the property is extracted by filtering on the feature name and selecting the `property_unit` column. The first element of this list is the unit of the property. A corresponding name for this unit is constructed by appending `_unit` to the feature name and stored in `feat_unity_name_l`.
* Appending Extracted Data to Lists: The extracted median values, variance values, and units are appended to their respective lists (`feat_median_values_l`, `feat_variance_values_l`, `feat_unity_l`). The constructed names for these values are also appended to their respective name lists.

In [18]:
feat_median_values_l = []
feat_variance_values_l = []
feat_unity_l = []

feat_median_values_name_l = []
feat_variance_values_name_l = []
feat_unity_name_l = []

for l in range(len(ddf['Features names'][5])):
    
    feat_median_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_value_median'])[0]
    feat_median_name_l = ddf['Features names'][5][l] + '_value_median'
    
    feat_variance_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_value_variance'])[0]
    feat_variance_name_l = ddf['Features names'][5][l] + '_value_variance'
    
    feat_property_unit_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_unit'])[0]
    feat_unit_name_l = ddf['Features names'][5][l] + '_unit'    
    
    feat_median_values_l.append(feat_median_l)
    feat_variance_values_l.append(feat_variance_l)
    feat_unity_l.append(feat_property_unit_l)
    
    feat_median_values_name_l.append(feat_median_name_l)
    feat_variance_values_name_l.append(feat_variance_name_l)
    feat_unity_name_l.append(feat_unit_name_l)

Now we can apply this to all polymers in the list

In [19]:
median_all = []
variance_all = []
unity_all = []

median_names_all = []
variance_names_all = []
unity_names_all = []

for s in range(len(poly_list)):
    feat_median_values_l = []
    feat_variance_values_l = []
    feat_unity_l = []

    feat_median_values_name_l = []
    feat_variance_values_name_l = []
    feat_unity_name_l = []

    for l in range(len(ddf['Features names'][s])):

        feat_median_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_value_median'])[0]
        feat_median_name_l = ddf['Features names'][s][l] + '_value_median'

        feat_variance_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_value_variance'])[0]
        feat_variance_name_l = ddf['Features names'][s][l] + '_value_variance'

        feat_property_unit_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_unit'])[0]
        feat_unit_name_l = ddf['Features names'][s][l] + '_unit'   

        feat_median_values_l.append(feat_median_l)
        feat_variance_values_l.append(feat_variance_l)
        feat_unity_l.append(feat_property_unit_l)

        feat_median_values_name_l.append(feat_median_name_l)
        feat_variance_values_name_l.append(feat_variance_name_l)
        feat_unity_name_l.append(feat_unit_name_l)
    
    median_all.append(feat_median_values_l)
    variance_all.append(feat_variance_values_l)
    unity_all.append(feat_unity_l)
    
    median_names_all.append(feat_median_values_name_l)
    variance_names_all.append(feat_variance_values_name_l)
    unity_names_all.append(feat_unity_name_l)


We are iterating over each polymer in `poly_list` and for each feature of that polymer as listed in `ddf['Features names']`. For each feature, we calculate three specific properties: median value, variance, and unit. These values are then stored in separate lists (`median_all`, `variance_all`, `unity_all`) along with their corresponding names (`median_names_all`, `variance_names_all`, `unity_names_all`). 

Here's a step-by-step explanation of what this code does:

* Initialization: For each polymer in poly_list, it initializes six temporary lists to hold the median values, variance values, and units for each property, as well as the names for these categories.
* Nested Loop:
    The outer loop [s] iterates over each polymer in `poly_list`.
    The inner loop [l] iterates over each feature name for the current polymer as specified in `ddf['Features names']`.
* Data Extraction: Within the inner loop, for each feature: It filters the current polymer's DataFrame (`poly_list[s]`) for rows where the property_name matches the current feature name (`ddf['Features names'][s][l]`). It extracts the median value, variance, and unit of the property from the filtered DataFrame and also constructs the names for these metrics by appending suffixes (`_value_median`, `_value_variance`, _`unit`) to the feature name.
* Appending Values: The extracted values and their constructed names are appended to the respective temporary lists.
* Aggregating Results: After processing all features of a polymer, the temporary lists are appended to the corresponding "all" lists (`median_all`, `variance_all`, `unity_all`, `median_names_all`, `variance_names_all`, `unity_names_all`), which aggregate the values and names across all polymers.

This code iterates over the length of poly_list, list of DataFrames. For each polymer, it zips together names and values for medians, variances, and units, then converts these zipped lists into dictionaries. These dictionaries are appended to separate lists (`medians`, `variances`, `units`) for each property type. 

By zipping the names and values together, you make the data more accessible for operations that require understanding the relationship between a feature's name and its numerical value or unit, such as generating reports, performing detailed analyses, or creating visualizations that require labeled data points.

In [20]:
medians = []
variances = []
units = []

for v in range(len(poly_list)):
    median_v = list(zip(median_names_all[v], median_all[v]))  
    variance_v = list(zip(variance_names_all[v], variance_all[v]))
    unity_v =  list(zip(unity_names_all[v], unity_all[v]))
    
    medians.append(dict(median_v))
    variances.append(dict(variance_v))
    units.append(dict(unity_v))

Here’s what each step does:

* Zipping names and values: The zip function is used to pair each feature's name (with a suffix indicating the property type, such as `_value_median`, `_value_variance`, or `_unit`) with its corresponding value (`median`, `variance`, or `unit`) for the current polymer (`v`). This results in a list of tuples for medians, variances, and units, where each tuple contains a name-value pair.
    * `median_v = list(zip(median_names_all[v], median_all[v]))`: Creates a list of tuples for median values and their names. 
    * `variance_v = list(zip(variance_names_all[v], variance_all[v]))`: Creates a list of tuples for variance values and their names.
    * `unity_v = list(zip(unity_names_all[v], unity_all[v]))`: Creates a list of tuples for unit names and their values.

* Converting to dictionaries: Each list of tuples is then converted into a dictionary using `dict()`. This conversion maps each feature's name (as the key) to its corresponding value (as the dictionary value). This is done for median values, variance values, and units, creating three dictionaries for each polymer.
    * `medians.append(dict(median_v))`: Appends a dictionary of median values (and their corresponding feature names) to the medians list.
    * `variances.append(dict(variance_v))`: Appends a dictionary of variance values (and their corresponding feature names) to the variances list.
    * `units.append(dict(unity_v))`: Appends a dictionary of units (and their corresponding feature names) to the units list.

* Appending to lists: Finally, each dictionary is appended to its corresponding list (medians, variances, units), creating a collection of dictionaries for each polymer. Each list (medians, variances, units) will contain as many dictionaries as there are polymers in `poly_list`, with each dictionary representing the median values, variance values, and units of properties for a single polymer.


In [21]:
# Sticking with our example of the 6th entry in the polymer list we check that we can access the median values for our entry
medians[5]

{'Density_value_median': 0.854,
 'Specific volume_value_median': 1.182,
 'Glass transition temperature_value_median': -48.3,
 'Melting temperature_value_median': 30,
 'Specific heat capacity cp_value_median': 0.4586,
 'Specific heat capacity cv_value_median': 0.291,
 'Gas permeability coefficient p_value_median': 3.7e-09,
 'Intrinsic viscosity eta_value_median': 0.29,
 'Radius of gyration_value_median': 9.83,
 'Softening temperature_value_median': -36,
 'SMILES_value_median': 'CCCCC(C*)*'}

In [22]:
# output a list of the properties that have a mean value associated with them for our polymer at position 6 [5]
list(medians[5].keys())

['Density_value_median',
 'Specific volume_value_median',
 'Glass transition temperature_value_median',
 'Melting temperature_value_median',
 'Specific heat capacity cp_value_median',
 'Specific heat capacity cv_value_median',
 'Gas permeability coefficient p_value_median',
 'Intrinsic viscosity eta_value_median',
 'Radius of gyration_value_median',
 'Softening temperature_value_median',
 'SMILES_value_median']

In [23]:
# We can see make a list of unique features in the list polyinfo and extract the 8th one
unique_features = list(set(list(polyinfo['property_name'])))
unique_features[7]

'Deflection temperature under load hdt'

This block iterates through the list of unique feature names (`unique_features`) and constructs new lists (`medians_unique`, `variances_unique`, `units_unique`) that contain strings representing the names of these features appended with specific suffixes (`_value_median`, `_value_variance`, and `_unit`, respectively). This process is typically used to prepare for data aggregation, analysis, or organization tasks where you need to differentiate between the median values, variances, and units associated with each unique feature. 

In [24]:
medians_unique = []
variances_unique = []
units_unique = []

for u in range(len(unique_features)):
    values_feat_u = unique_features[u] + '_value_median'
    variances_feat_u = unique_features[u] + '_value_variance'
    units_unique_u = unique_features[u] + '_unit'
    
    medians_unique.append(values_feat_u)
    variances_unique.append(variances_feat_u)
    units_unique.append(units_unique_u)

The following code iterates over each polymer name in the ddf DataFrame (`for q in range(len(ddf['polymer_name'])):`) and constructs a list of median values for all unique features for each polymer. 

This is done by:

* Creating an empty list `median_q` for each polymer.
* Iterating through the `medians_unique` list and using the `.get()` method on the medians dictionary to fetch the median value for each unique feature. If a feature is not present for a polymer, `None` is appended instead, ensuring that `median_q` reflects all unique features, with values for those that are applicable and `None` for those that aren't.
* Appending the list `median_q` to the `mega_medians` list, which eventually contains a list of median values (or `None`) for each unique feature across all polymers in `ddf`.

In [25]:
mega_medians = []
for q in range(len(ddf['polymer_name'])):
    median_q = []
    for u in range(len(medians_unique)):
        median_qu = medians[q].get(medians_unique[u], None)
        median_q.append(median_qu)
    mega_medians.append(median_q)

The mega_medians list represents a structured compilation of median values for all unique properties/features across all polymers, aligning with the structure of `ddf`.

Next we create a pandas DataFrame named `mdf` from the list `mega_medians`, with the columns named according to the list `medians_unique`. 

Here's a breakdown of its components and functionality:

* `pd.DataFrame (mega_medians, columns=medians_unique)`: This function call constructs a DataFrame from `mega_medians`, which is expected to be a list of lists, where each inner list contains median values (or `None` for missing values) for all unique features across all polymers. The `columns=medians_unique` argument specifies that the column names of the DataFrame should be taken from the `medians_unique` list, which contains strings representing the names of unique features with `_value_median` appended.
* `mdf`: This is the variable assigned to hold the newly created DataFrame.

The resulting DataFrame `mdf` will have a row for each polymer (as many rows as there are in `ddf['polymer_name']`), and each column will represent a unique property's median value across these polymers. The column names will reflect the unique property names with an added `_value_median` suffix to indicate these columns store median values of the respective properties.

In [26]:
mdf = pd.DataFrame(mega_medians, columns=medians_unique)
mdf

Unnamed: 0,Flexural creep strain_value_median,Dielectric loss tangent_value_median,Tensile creep rupture time_value_median,Dynamic mechanical properties loss modulus_value_median,Pvt relation temperature_value_median,Vicat softening temperature_value_median,Flexural stress strength at break_value_median,Deflection temperature under load hdt_value_median,Thermal decomposition weight loss_value_median,Cohesive energy density_value_median,...,Gas solubility coefficient s_value_median,Thermal decomposition temperature_value_median,Dynamic mechanical properties loss tangent_value_median,Hansen parameter delta-d: dispersion component_value_median,Dynamic viscosity loss tangent_value_median,Volume resistivity_value_median,Water absorption_value_median,Heat of fusion_value_median,Dynamic flexural properties storage modulus_value_median,Dynamic shear properties storage modulus_value_median
0,,0.00045,10.0,0.150,180.0,123.0,0.0198,64.5,10.0,64.600,...,9.400000e-07,435.0,0.127,0.0,,5.450000e+14,0.01000,0.0370,0.934,3.200000e-05
1,,0.00100,20000.0,0.086,95.0,138.0,0.0395,84.0,10.0,73.000,...,6.730000e-06,373.0,0.055,,33.0,4.050000e+14,0.01145,0.0210,1.700,1.000000e-05
2,,,,0.140,240.9,99.0,,,,60.705,...,,,,,,,,0.0140,,3.400000e-07
3,,,,0.220,,,,,,,...,,,,,,,,0.0043,,
4,,,,,,,,155.0,0.0,,...,,452.5,,,,,,0.0590,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18307,,,,,,,,,,,...,,,,,,,,,,
18308,,,,,,,,,,,...,,,,,,,,,,
18309,,,,,,,,,,,...,,,,,,4.150000e+00,,,,
18310,,,,,,,,,,,...,,,,,,,,,,


The provided code performs a similar operation as previously described for medians, but this time it focuses on variances. It constructs a list named mega_variances, where each element is a list containing the variance values (or `None` for missing values) for all unique features across all polymers in the `ddf`

In [27]:
mega_variances = []
for q in range(len(ddf['polymer_name'])):
    variance_q = []
    for u in range(len(variances_unique)):
        variance_qu = variances[q].get(variances_unique[u], None)
        variance_q.append(variance_qu)
    mega_variances.append(variance_q)

In [28]:
vdf = pd.DataFrame(mega_variances, columns=variances_unique)

In [29]:
vdf

Unnamed: 0,Flexural creep strain_value_variance,Dielectric loss tangent_value_variance,Tensile creep rupture time_value_variance,Dynamic mechanical properties loss modulus_value_variance,Pvt relation temperature_value_variance,Vicat softening temperature_value_variance,Flexural stress strength at break_value_variance,Deflection temperature under load hdt_value_variance,Thermal decomposition weight loss_value_variance,Cohesive energy density_value_variance,...,Gas solubility coefficient s_value_variance,Thermal decomposition temperature_value_variance,Dynamic mechanical properties loss tangent_value_variance,Hansen parameter delta-d: dispersion component_value_variance,Dynamic viscosity loss tangent_value_variance,Volume resistivity_value_variance,Water absorption_value_variance,Heat of fusion_value_variance,Dynamic flexural properties storage modulus_value_variance,Dynamic shear properties storage modulus_value_variance
0,,0.186204,0.0,0.144664,4.496405e+06,235.392655,0.000090,463.861852,646.466214,42.613812,...,6.126189e-09,7598.312090,0.008877,0.0,,3354844969133619982821830547268557602816,0.001996,0.000221,1.138410,3.445363e-01
1,,3.068865,0.0,0.043990,2.330829e+06,432.672712,0.127296,532.929171,466.351475,123.966957,...,2.817766e-06,6815.698995,0.022569,,588.0,1620381465120619970405709793344880640,0.610991,0.000071,100.427099,1.030412e+00
2,,,,0.011743,1.187042e+05,0.000000,,,,472.219751,...,,,,,,,,0.000152,,1.159326e-09
3,,,,0.000000,,,,,,,...,,,,,,,,0.000000,,
4,,,,,,,,0.000000,0.000000,,...,,0.500000,,,,,,0.000000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18307,,,,,,,,,,,...,,,,,,,,,,
18308,,,,,,,,,,,...,,,,,,,,,,
18309,,,,,,,,,,,...,,,,,,1514753521.55152,,,,
18310,,,,,,,,,,,...,,,,,,,,,,


Now we will concatenate three pandas DataFrames (`ddf`, `mdf`, and `vdf`) along their columns, resulting in a single DataFrame named `united`. 

Here's how it works and what it accomplishes:

* `pd.concat([ddf, mdf, vdf], axis=1)`: This function call uses `pd.concat` to concatenate the given list of DataFrames along `axis=1`, which means the concatenation happens horizontally, adding the columns of `mdf` and `vdf` to those of `ddf`. This results in a wider DataFrame that combines the information from all three source DataFrames into a single structure.
    * ddf: Contains basic information about each polymer, such as names or other identifiers.
    * mdf: Containa median values for various properties of the polymers, with columns named for each property followed by `_value_median`.
    * vdf: Holds variance values for the same properties, with column names including each property followed by `_value_variance`.

**`united:`** The resulting DataFrame is assigned to this variable. united will have the combined columns from ddf, mdf, and vdf. Because ddf contains general information about each polymer, and mdf and vdf add detailed statistical metrics (medians and variances, respectively) for each property, then united offers a comprehensive view of all these aspects in one table.

In [30]:
united = pd.concat([ddf, mdf, vdf], axis=1)

We saw something od earlier that the SMILES columns had both mean and median which is counter intuitive.  So we look at our data to check this. In the code below Python list comprehension is employed that filters and extracts column names from a pandas DataFrame (`united`) that contain the substring 'SMILES'. 

Here's a breakdown of its components:

* `smiles =`: Assigns the result of the list comprehension to a variable named smiles.
* `[f for f in list(united.columns)`: This is the list comprehension itself. It iterates over each column name in the DataFrame united (accessed via `united.columns`) and converts the column names to a list with `list(united.columns)`.
* `if 'SMILES' in f`: This is the condition within the list comprehension that filters each column name (`f`) by checking if the substring 'SMILES' is present in it.

The result is a list of column names from the DataFrame united that include 'SMILES' in their names. This identifies columns related to the Simplified Molecular Input Line Entry System (SMILES), a notation that allows the representation of chemical species' structures using ASCII strings.

In [31]:
smiles = [f for f in list(united.columns) if 'SMILES' in f]

In [32]:
# Now output the result of the list comprehension to see what SMILES columns we have
united[smiles]

Unnamed: 0,SMILES_value_median,SMILES_value_variance
0,*C*,
1,*CC(C)*,
2,*CC(CC)*,
3,CCCC(C*)*,
4,*CC(C(C)C)*,
...,...,...
18307,*c1c(c2ccccc2)c(c2ccccc2)c(c2c1cc(cc2)C1(c2ccc...,
18308,[Na]OS(=O)(=O)C(C(OC(C(OC(C(F)(F)*)(F)*)(F)F)(...,
18309,*c1ccc(c2c1cccc2)*,
18310,*C=C([As](c1ccccc1)*)c1ccc(cc1)C=C**1=CC=CC=C1,


Now we see we have odd column names and an empty SMILES variance column. So lets remove a column named `SMILES_value_variance` from the DataFrame `united`.

Here's a detailed explanation of each part of the command:

* `united.drop([...], axis=1, inplace=True)`: This method is used to delete columns or rows from a DataFrame. The key parameters used here are:
    * The first argument is a list of labels specifying the names of the columns or indices of the rows you want to drop. In this case, `['SMILES_value_variance']` is the column to be removed.
    * `axis=1` specifies that the operation should be performed on columns. `axis=0` would mean rows.
    * `inplace=True` means that the change should be applied directly to united without the need to assign the result to a new DataFrame. The DataFrame `united` is modified in place, effectively removing the specified column.


In [33]:
united.drop(['SMILES_value_variance'], axis=1, inplace=True)

Now we need to deal with renaming the last remaining SMILES column. Tne next line of code changes the name of a column in the DataFrame united from `SMILES_value_median` to `SMILES`. 

Here's a breakdown of how the command works:

* `united.rename(columns={...}, inplace=True)`: This method is used to rename the labels of a DataFrame.
    * The `columns` parameter is a dictionary where keys are the existing column names and values are the new names you want to assign. In this case, the column currently named `SMILES_value_median` is being renamed to `SMILES`.
    * The `inplace=True` argument specifies that this renaming should occur in place, meaning the original DataFrame united is directly modified, and you don't need to create a new DataFrame to see the changes.


In [34]:
united.rename(columns={'SMILES_value_median': 'SMILES'}, inplace=True)

Next we output our reshaped dataset as an excel and a csv file

In [36]:
united.to_excel('resulting_dataset.xlsx')

In [37]:
united.to_csv('resulting_dataset.csv')

Now we need to convert chemical structures represented as SMILES (Simplified Molecular Input Line Entry System) strings into numerical vectors using RDKit. RDKit a widely used open-source cheminformatics library. The conversion process involves generating molecular fingerprints, specifically Morgan fingerprints, which provide a way to encode molecular structure information into a fixed-size bit vector. 

Here's a detailed breakdown:

* Import necessary libraries:
from rdkit import Chem and from rdkit.Chem import AllChem import the necessary modules from RDKit for handling chemical information and generating fingerprints.

* Define the vectorize_smiles function:
This function takes a single SMILES string as input.
`mol = Chem.MolFromSmiles(smiles)` converts the SMILES string into an RDKit molecule object. If the SMILES string is invalid and cannot be converted, `mol` will be None.
If `mol` is not `None`, it uses: 
    * `AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)` to calculate the Morgan fingerprint of the molecule, specifying a radius of 2 and a bit vector size of 1024. The fingerprint is then converted to a list of integers (0s and 1s) representing the presence or absence of certain molecular features.
    * If the SMILES string is invalid, it returns a list of 1024 zeros, representing an "empty" fingerprint.

* Apply the function to the 'SMILES' column of the DataFrame united:
`united['SMILES'][0:498].apply(vectorize_smiles)` applies the vectorize_smiles function to each SMILES string in the first 498 rows of the 'SMILES' column of the DataFrame united. The result is a Series where each entry is the list of bits (molecular fingerprint) corresponding to the molecule represented by the SMILES string in that row.

The choice to process only the first 498 is due to dataset size limitations to avoid processing a very large dataset all at once. 

**Storing the result:**
The code snippet ends with storing the Series of molecular fingerprints in united_vector. Each element in this Series is a list of 1024 integers (0s or 1s), representing the fingerprint of the corresponding molecule.

In [51]:
from rdkit import Chem
from rdkit.Chem import AllChem

# Assuming you have a DataFrame with a 'SMILES' column
# Create a function to vectorize SMILES
def vectorize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        # Calculate molecular fingerprint
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
        return list(fp)
    else:
        return [0] * 1024  # Return a zero vector for invalid SMILES

# Assuming 'united' is your DataFrame with a 'SMILES' column
united_vector = united['SMILES'][0:498].apply(vectorize_smiles)

# The 'vector' column will contain the molecular fingerprint for each SMILES

If we run to 499 then we get an error... What is going on? 

In [52]:
# First we print out the united_vector to ensure this is all ok.
united_vector

0      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1      [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2      [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3      [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4      [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
                             ...                        
493    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
494    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
495    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
496    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
497    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ...
Name: SMILES, Length: 498, dtype: object

In [56]:
# No error so we try and calculate the fingerprints for the next few lines after 498
united_vector = united['SMILES'][499:501].apply(vectorize_smiles)

In [58]:
# No error so lets print the problem line 498
united['SMILES'][498]

In [60]:
# Bingo no output so our function is erroring out at a blank line. Lets modify the function definition and try again

The modified code block now includes a check for None values in the vectorize_smiles function, making it more robust by handling cases where the 'SMILES' column might contain `None` values. This ensures that any `None` value results in a zero vector, consistent with the handling of invalid SMILES strings.

In [61]:
# Create a function to vectorize SMILES
def vectorize_smiles(smiles):
    if smiles is None:
        return [0] * 1024  # Return a zero vector for None this is new before it didn't work
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        # Calculate molecular fingerprint
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
        return list(fp)
    else:
        return [0] * 1024  # Return a zero vector for invalid SMILES

# Assuming 'united' is your DataFrame with a 'SMILES' column
united['vector'] = united['SMILES'].apply(vectorize_smiles)

[20:18:58] Explicit valence for atom # 8 C, 6, is greater than permitted
[20:18:58] Explicit valence for atom # 0 B, 5, is greater than permitted
[20:18:58] Explicit valence for atom # 0 B, 5, is greater than permitted
[20:18:58] Explicit valence for atom # 0 B, 5, is greater than permitted
[20:19:02] Explicit valence for atom # 8 C, 5, is greater than permitted
[20:19:12] Explicit valence for atom # 3 C, 6, is greater than permitted
[20:19:18] Explicit valence for atom # 40 C, 5, is greater than permitted
[20:19:20] Explicit valence for atom # 2 C, 5, is greater than permitted
[20:19:20] Explicit valence for atom # 41 C, 5, is greater than permitted
[20:19:20] Explicit valence for atom # 16 C, 6, is greater than permitted
[20:19:20] Explicit valence for atom # 16 C, 6, is greater than permitted
[20:19:23] Explicit valence for atom # 8 C, 5, is greater than permitted
[20:19:23] Explicit valence for atom # 0 B, 5, is greater than permitted
[20:19:23] Explicit valence for atom # 0 B, 5, 

In [62]:
united.to_csv('resulting_dataset.csv')

In [66]:
cats = polyinfo[['property_category', 'property_name']]
cats = cats.drop_duplicates()

In [69]:
cats.to_csv('categories_feats.csv')

In [70]:
cats

Unnamed: 0,property_category,property_name
0,Physical property,Density
1,Physical property,Specific volume
2,Optical property,Refractive index
3,Thermal property,Crystallization kinetics r
4,Thermal property,Crystallization kinetics k
...,...,...
357,Dilute solution property,Diffusion coefficient
609,Dilute solution property,Sedimentation coefficient
3205,Creep characteristics,Flexural creep strain
4777,Creep characteristics,Tensile creep rupture strength


Now we will create a dictionary where each key is a unique value from the `property_category` column of a DataFrame named `cats`, and each corresponding value is a list of `property_name` values associated with that category. 

Here's a breakdown of how this code works:

* `cats.groupby('property_category')`: This part groups the DataFrame cats by the values in the `property_category` column. It organizes the DataFrame such that all rows sharing the same `property_category` value are grouped together, facilitating operations that are specific to each category.
* `['property_name'].apply(list)`: After grouping, this code selects the `property_name` column for each group and applies the list function. The `apply(list)` operation converts the `property_name` values for each category into a list. This means for every unique `property_category`, you get a list of `property_name` values that fall into that category.
* `.to_dict()`: Finally, the `.to_dict()` method converts the Series of lists into a dictionary. The keys of this dictionary are the unique values from the `property_category` column, and the values are the lists of `property_name` values associated with each key.

The resulting category_dict is a dictionary that maps each property category to a list of property names within that category. This can be extremely useful for organizing, summarizing, or navigating your dataset based on categories and their associated properties.

In [71]:
# Group the DataFrame by 'property_category' and aggregate 'property_name' as a list
category_dict = cats.groupby('property_category')['property_name'].apply(list).to_dict()

# category_dict will contain the desired dictionary

Now we can save a Python dictionary (`category_dict`) to a file in JSON format. 

Here's a step-by-step breakdown:

* Import the json module: The json module is part of Python's standard library and provides a simple way to encode and decode JSON data. Here, it's used for converting the Python dictionary into a JSON-formatted string and writing it to a file.
* Specify the file path: The variable `file_path` is assigned the value `category_dict.json`, indicating the name (and possibly the location) of the file where you want to save the dictionary. If only the filename is provided (as in this case), the file will be created in the current working directory of the script.
* Open the file in write mode: The with `open(file_path, 'w') as file:` syntax opens the file for writing (`w` mode), creating it if it doesn't exist, and binds the open file to the variable file for the duration of the with block. The with statement ensures proper acquisition and release of resources; the file is automatically closed when the block is exited, even if an error occurs.

* Write the dictionary to the file in JSON format: `json.dump(category_dict, file)` serializes category_dict as a JSON formatted stream to file. This means the dictionary is converted into a JSON string and written to the file specified by `file_pat`
.

In [72]:
import json


file_path = 'category_dict.json'

# Write the dictionary to a JSON file
with open(file_path, 'w') as file:
    json.dump(category_dict, file)

In [74]:
physics = united[['vector', 'Thermal conductivity_value_median', 'Specific volume_value_median']]

In [75]:
filtered_data = physics[(physics['Thermal conductivity_value_median'].notna()) & (physics['Specific volume_value_median'].notna())]

In [77]:
data = physics[(physics['Specific volume_value_median'].notna())][['vector', 'Specific volume_value_median']]

In [81]:
data

Unnamed: 0,vector,Specific volume_value_median
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.07300
1,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.10600
2,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.10000
3,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.10000
4,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.10000
...,...,...
18219,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.60459
18220,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.60801
18221,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.56857
18222,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",0.54969
