## Leer archivos CSV y otros con Pandas
Con Pandas podemos leer diferentes tipos de archivos, desde los más básicos como Csv o json, entre otros como excel, sql.


In [7]:
import numpy as np
import pandas as pd

En este caso el archivo se importa con ***pd.read_csv***.(Recordando que se pueden leer otros archivos, además de csv)

Con *header*, se determina desde qué casilla empiezan los encabezados, con *names* se decide qué nombre se quiere que tenga cada columna, *sep* nos indica  cómo está separada una columna de la otra 

In [8]:
df_salaries=pd.read_csv("ds_salaries.csv",index_col=0,sep=",")
df_salaries

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,USD,154000,US,100,US,M
603,2022,SE,FT,Data Engineer,126000,USD,126000,US,100,US,M
604,2022,SE,FT,Data Analyst,129000,USD,129000,US,0,US,M
605,2022,SE,FT,Data Analyst,150000,USD,150000,US,100,US,M


Método ***info***, nos da una vista general de los datos, y nos puede dar una idea de cómo empezar a limpiar los mismos

In [22]:
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  607 non-null    object
 8   remote_ratio        607 non-null    int64 
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(4), object(7)
memory usage: 56.9+ KB


### Ejemplos

In [11]:
np.max(df_salaries,axis=0)

work_year                             2022
experience_level                        SE
employment_type                         PT
job_title             Staff Data Scientist
salary                            30400000
salary_currency                        USD
salary_in_usd                       600000
employee_residence                      VN
remote_ratio                           100
company_location                        VN
company_size                             S
dtype: object

In [17]:
df_salaries[df_salaries["job_title"]=="AI Scientist"]

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
52,2020,EN,FT,AI Scientist,300000,DKK,45896,DK,50,DK,S
96,2021,EN,PT,AI Scientist,12000,USD,12000,BR,100,US,S
113,2021,EN,PT,AI Scientist,12000,USD,12000,PK,100,US,M
244,2021,EN,FT,AI Scientist,1335000,INR,18053,IN,100,AS,S
277,2021,SE,FT,AI Scientist,55000,USD,55000,ES,100,ES,L
391,2022,MI,FT,AI Scientist,120000,USD,120000,US,0,US,M
606,2022,MI,FT,AI Scientist,200000,USD,200000,IN,100,US,L


In [20]:
np.unique(df_salaries["experience_level"])

array(['EN', 'EX', 'MI', 'SE'], dtype=object)

In [21]:
np.unique(df_salaries["job_title"])

array(['3D Computer Vision Researcher', 'AI Scientist',
       'Analytics Engineer', 'Applied Data Scientist',
       'Applied Machine Learning Scientist', 'BI Data Analyst',
       'Big Data Architect', 'Big Data Engineer', 'Business Data Analyst',
       'Cloud Data Engineer', 'Computer Vision Engineer',
       'Computer Vision Software Engineer', 'Data Analyst',
       'Data Analytics Engineer', 'Data Analytics Lead',
       'Data Analytics Manager', 'Data Architect', 'Data Engineer',
       'Data Engineering Manager', 'Data Science Consultant',
       'Data Science Engineer', 'Data Science Manager', 'Data Scientist',
       'Data Specialist', 'Director of Data Engineering',
       'Director of Data Science', 'ETL Developer',
       'Finance Data Analyst', 'Financial Data Analyst', 'Head of Data',
       'Head of Data Science', 'Head of Machine Learning',
       'Lead Data Analyst', 'Lead Data Engineer', 'Lead Data Scientist',
       'Lead Machine Learning Engineer', 'ML Engineer',


## .loc
Filtra según un label


In [23]:

df_salaries.loc[:]


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,USD,154000,US,100,US,M
603,2022,SE,FT,Data Engineer,126000,USD,126000,US,100,US,M
604,2022,SE,FT,Data Analyst,129000,USD,129000,US,0,US,M
605,2022,SE,FT,Data Analyst,150000,USD,150000,US,100,US,M



Mostrar un rango de filas tomando en cuenta el start y el end


In [24]:

df_salaries.loc[0:4] 


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


Filtrando por filas y columnas


In [25]:
df_salaries.loc[0:4, ['job_title', 'salary']] 


Unnamed: 0,job_title,salary
0,Data Scientist,70000
1,Machine Learning Scientist,260000
2,Big Data Engineer,85000
3,Product Data Analyst,20000
4,Machine Learning Engineer,150000


filtra los datos de la fila que va de 0 a 4 y de las columnas Name y Author.

Podemos modificar los valores de una columna específica del dataFrame


In [26]:
df_salaries.loc[:, ['salary']] * -1


Unnamed: 0,salary
0,-70000
1,-260000
2,-85000
3,-20000
4,-150000
...,...
602,-154000
603,-126000
604,-129000
605,-150000


multiplica por -1 todos los valores de la columna Reviews.

Filtrar datos que cumplan una condición determinada


In [27]:
df_salaries.loc[:, ['job_title']] == 'Data Scientist' 


Unnamed: 0,job_title
0,True
1,False
2,False
3,False
4,False
...,...
602,False
603,False
604,False
605,False


muestra la columna Job_title con True en los valores que cumplen la condicion y False para los que no la cumplen


## .iloc
Filtra mediante índices.


In [29]:

df_salaries.iloc[:] #muestra todos los datos del dataframe


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,USD,154000,US,100,US,M
603,2022,SE,FT,Data Engineer,126000,USD,126000,US,100,US,M
604,2022,SE,FT,Data Analyst,129000,USD,129000,US,0,US,M
605,2022,SE,FT,Data Analyst,150000,USD,150000,US,100,US,M


Filtrar datos según los índices de las filas y las columnas


In [31]:
df_salaries.iloc[:4, 2:6]  #muestra los datos de las filas que van de 0 a 3 y las columnas con indices 2 y 6


Unnamed: 0,employment_type,job_title,salary,salary_currency
0,FT,Data Scientist,70000,EUR
1,FT,Machine Learning Scientist,260000,USD
2,FT,Big Data Engineer,85000,GBP
3,FT,Product Data Analyst,20000,USD


Buscar un dato específico.


In [32]:
df_salaries.iloc[1,3]  #muestra el dato alojado en la fila 1 columna 3

'Machine Learning Scientist'