# Collecting Data

When we want to collect data in the shape of tables/dataframes, [**Pandas**](https://pandas.pydata.org/) library is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

**Pandas** works with a data structure called data frame, which is a matrix with rows and columns.

![dataframe](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

To be able to use **Pandas** you first need to install the library by using [pip](https://pypi.org/project/pip/):

```shell
# Command line
pip install pandas
```

```python
# Notebook
!pip install pandas
```

And then you can import the library with:

```python
import pandas as pd
```

Once the library is imported you can create your own dataframes (test this example code):

```python

# From a Python list
areas = ['Sustainable Chemicals', 'Natural Products', 'Microbial Foods']

df_areas = pd.DataFrame(areas)

# From a dictionary
research = {'areas': areas, \
            'contact': ['Andreas Worberg', 'Ditte Hededam Welner', 'Morten Sommer'], \
            'num_groups': [2, 4, 3], \
            'description': ['An applied translational application program, focusing on the identification and pilot production of commercially viable metabolites in a microbial scaleable production strain. ', \
                            'This program develops tools for yeast and actinomycete cell factories to produce any plant, fungal or actinobacterial natural product.', \
                            'We are rethinking food production from scratch, from a biosustainability point of view.']
            }

df_research = pd.DataFrame(research)


df_areas.head()
df_research.head()
```

## Loading Data with Pandas

![loading](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)



Pandas provides the [**read_csv()**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function to read data stored as a csv/tsv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.

You can read from a local file:

```python
import pandas as pd

data = pd.read_csv("my_file.csv", sep=',')

data.head()

```

Or from a url, for instance from [PRIDE](https://www.ebi.ac.uk/pride) database:
```python
import pandas as pd

data = pd.read_csv("https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD041975/experimentalDesignTemplate.txt", sep='\t')

data.head()
```


Some other data samples you can use:

**Excel**
https://res.cloudinary.com/opquast/raw/upload/checklists/OPQUAST-CHECKLIST-EN_2020.xls

**Zipped txt**
https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD041957txt.zip

**Json**
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_json/17299597/unicode
