In [4]:
import pandas as pd
import numpy as np
import swco
import feather
import xlrd

## K-mers

A k-mer is a part of a string of longitude k. This substrings are created consequently. So, all the k-mers of a string are all the consecutive substrings contained in a string.
The aim of this script is to generate a dictionary with all the possible k-mers we found among all the species. The dictionary have the possible k-mers as keys and the values will be the Specie and the Scaffold where this k-mers have been found in the Specieal string.

In [5]:
# To load all of them at once and afterwards accessing one by one, this might be an option.
#https://towardsdatascience.com/a-simple-trick-to-load-multiple-excel-worksheets-in-pandas-3fae4124345b
# Define filepath
filepath = '../Data/Raw/Tables_Filtered_IK_format.xlsx'

# Load Excel file using Pandas with `sheet_name=None`
df_dict = pd.read_excel(filepath, sheet_name=None)

# Preview
#df_dict

# Get a specific one
#human = df_dict.get('Human')

# aprox 3 min 40 secs

In [6]:
# Data cleaning for each specie
df_species = pd.DataFrame()

species = df_dict.keys()

for s in species:
    aux = df_dict.get(s)
    aux['Specie'] = s
    df_species = pd.concat([df_species, aux])

In [7]:
df_species = swco.preprocessing(df_species)
df_species.astype(str).to_feather('../Data/Intermediate/annotation.feather')

In [None]:
# When creating the k-mers, duplicates do not matter, so we can remove them
df_species = swco.cons_duplicates_kmers(df_species)

Create and save a list of the species compared with their number of genes

In [10]:
df_species.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 389222 entries, 0 to 389221
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Index               389222 non-null  int32  
 1   #Replicon Name      389222 non-null  object 
 2   Replicon Accession  389222 non-null  object 
 3   Start               389222 non-null  int64  
 4   Stop                389222 non-null  int64  
 5   Strand              389222 non-null  object 
 6   GeneID              367765 non-null  object 
 7   Locus               389222 non-null  object 
 8   Protein product     389222 non-null  object 
 9   Length              389222 non-null  int64  
 10  Protein name        389221 non-null  object 
 11  Unnamed: 10         25281 non-null   float64
 12  Specie              389222 non-null  object 
 13  Locus tag           47639 non-null   object 
 14  Geneid              21457 non-null   float64
 15  Gene_non_or         389222 non-nul

In [16]:
# Number of genes per specie
species = (df_species[['Specie', 'Gene_non_or']]
            .groupby('Specie', as_index=False)
            .count())
species.columns = ['Specie', 'Number of genes']

# Save them
species.to_csv('../Data/Intermediate/species.csv', index=False)
species

Unnamed: 0,Specie,Number of genes
0,Aadvark,14841
1,Alligator M,13596
2,Alligator S,13140
3,Anolis,12650
4,Chelonia,12293
5,Chrysemys,11218
6,Croco,12138
7,Danio,25221
8,Devil,14059
9,Dog,60


In [6]:
# Takes 1 hour or so untill 200
# 1-mers
df_genes = df_species[['Gene_non_or', 'Specie_Scaffold']]

(df_genes[['Specie_Scaffold', 'Gene_non_or']]
                                    .groupby('Gene_non_or', as_index = False)
                                    .agg(Specie_Scaffold=('Specie_Scaffold', list), Scaffolds=('Gene_non_or', 'count'))
                                    .to_csv('../Data/Intermediate/k_mers/Scaffold/1_mer.csv', index=False))

# 2-mers
df_genes.loc[df_genes['Specie_Scaffold'].shift(-1) == df_genes['Specie_Scaffold'], '2_mers'] = df_genes['Gene_non_or'] + '_' + df_genes['Gene_non_or'].shift(-1)

k = 2
(df_genes[['Specie_Scaffold', str(k) + '_mers']]
                                    .groupby(str(k) + '_mers', as_index = False)
                                    .agg(Specie_Scaffold=('Specie_Scaffold', list), Scaffolds=(str(k) + '_mers', 'count'))
                                    .to_csv('../Data/Intermediate/k_mers/Scaffold/' + str(k) + '_mers.csv', index=False))

# 3-mers onwards
for k in range(3, 250):
    df_genes.loc[df_genes['Specie_Scaffold'].shift(-(k-1)) == df_genes['Specie_Scaffold'], str(k) + '_mers'] = df_genes[str(k-1)+'_mers'] + '_' + df_genes['Gene_non_or'].shift(-(k-1))

    (df_genes[['Specie_Scaffold', str(k) + '_mers']]
                                        .groupby(str(k) + '_mers', as_index = False)
                                        .agg(Specie_Scaffold=('Specie_Scaffold', list), Scaffolds=(str(k) + '_mers', 'count'))
                                        .to_csv('../Data/Intermediate/k_mers/Scaffold/' + str(k) + '_mers.csv', index=False))

    del df_genes[str(k-1) + '_mers']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_genes.loc[df_genes['Specie_Scaffold'].shift(-1) == df_genes['Specie_Scaffold'], '2_mers'] = df_genes['Gene_non_or'] + '_' + df_genes['Gene_non_or'].shift(-1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_genes.loc[df_genes['Specie_Scaffold'].shift(-(k-1)) == df_genes['Specie_Scaffold'], str(k) + '_mers'] = df_genes[str(k-1)+'_mers'] + '_' + df_genes['Gene_non_or'].shift(-(k-1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See t