**Tidy data** is a framework to structure data sets, so they can be easily analyzed and visualized
+ what is tiny data??
+ each row is an observation
+ each column is a variable
+ each type of observational unit forms a table

**Columns Contain Values, not Variables**

In [1]:
import pandas as pd
pew = pd.read_csv(r'/Users/BrendanErhard/Desktop/Python/Datasets/danielchen/pew.csv')

In [2]:
# when we look at this data set, we can see that not every column is a variable
# the values that relate to income, are spread across multiple columns
# the format shown is a great choice when presenting data in a table
# but for data analytics, the table needs to be re-shaped so that have religion, count and income variables

# show only the first few columns
print(pew.iloc[:, 0:6])

                   religion  <$10k  $10-20k  $20-30k  $30-40k  $40-50k
0                  Agnostic     27       34       60       81       76
1                   Atheist     12       27       37       52       35
2                  Buddhist     27       21       30       34       33
3                  Catholic    418      617      732      670      638
4        Don’t know/refused     15       14       15       11       10
5          Evangelical Prot    575      869     1064      982      881
6                     Hindu      1        9        7        9       11
7   Historically Black Prot    228      244      236      238      197
8         Jehovah's Witness     20       27       24       24       21
9                    Jewish     19       19       25       25       30
10            Mainline Prot    289      495      619      655      651
11                   Mormon     29       40       48       51       56
12                   Muslim      6        7        9       10        9
13    

In [3]:
# the above view is known as 'wide data'
# to turn the data into 'long' tidy data format, will have to unpivot/melt/gather the df
# pandas has a function called [melt] that will reshape the df into tidy format
    # id_vars = a container (list, tuple or ndarray) that represents the variables that will remain, as is
    # value_vars = the columns to melt don or unpivot
    # var_name = a string for the new column name, when value_vars is melted down
    # value_name = a string for the new column name that represents the values for the var_name
    
# do not need to specify a value_vars below, since the goal is to pivot all columns, except religion

pew_long = pd.melt(pew, id_vars = 'religion')
print(pew_long.head())

             religion variable  value
0            Agnostic    <$10k     27
1             Atheist    <$10k     12
2            Buddhist    <$10k     27
3            Catholic    <$10k    418
4  Don’t know/refused    <$10k     15


In [4]:
print(pew_long.tail())


                  religion            variable  value
175               Orthodox  Don't know/refused     73
176        Other Christian  Don't know/refused     18
177           Other Faiths  Don't know/refused     71
178  Other World Religions  Don't know/refused      8
179           Unaffiliated  Don't know/refused    597


In [5]:
# can change the defualts so tht the melted/unpivoted columns are named
pew_long = pd.melt(pew, id_vars='religion', var_name='income', value_name='count')
print(pew_long.head())

             religion income  count
0            Agnostic  <$10k     27
1             Atheist  <$10k     12
2            Buddhist  <$10k     27
3            Catholic  <$10k    418
4  Don’t know/refused  <$10k     15


In [6]:
print(pew_long.tail())

                  religion              income  count
175               Orthodox  Don't know/refused     73
176        Other Christian  Don't know/refused     18
177           Other Faiths  Don't know/refused     71
178  Other World Religions  Don't know/refused      8
179           Unaffiliated  Don't know/refused    597


**Keep Multiple Columns Fixed**
+ not every data set will have 1 column to hold still, while you unpivot the rest of the columns

In [7]:
billboard = pd.read_csv(r'/Users/BrendanErhard/Desktop/Python/Datasets/danielchen/billboard.csv')

# look at the first few rows and columns
print(billboard.iloc[0:5, 0:16])

   year        artist                    track  time date.entered  wk1   wk2  \
0  2000         2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26   87  82.0   
1  2000       2Ge+her  The Hardest Part Of ...  3:15   2000-09-02   91  87.0   
2  2000  3 Doors Down               Kryptonite  3:53   2000-04-08   81  70.0   
3  2000  3 Doors Down                    Loser  4:24   2000-10-21   76  76.0   
4  2000      504 Boyz            Wobble Wobble  3:35   2000-04-15   57  34.0   

    wk3   wk4   wk5   wk6   wk7   wk8   wk9  wk10  wk11  
0  72.0  77.0  87.0  94.0  99.0   NaN   NaN   NaN   NaN  
1  92.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
2  68.0  67.0  66.0  57.0  54.0  53.0  51.0  51.0  51.0  
3  72.0  69.0  67.0  65.0  55.0  59.0  62.0  61.0  61.0  
4  25.0  17.0  17.0  31.0  36.0  49.0  53.0  57.0  64.0  


In [8]:
# can see taht each week has its own column
# there may be a time when need to melt the data - ex, to create a facet plot of the weekly ratings

billboard_long = pd.melt(billboard, id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
                        var_name = 'week',
                        value_name = 'rating')
print(billboard_long.head())

   year        artist                    track  time date.entered week  rating
0  2000         2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk1    87.0
1  2000       2Ge+her  The Hardest Part Of ...  3:15   2000-09-02  wk1    91.0
2  2000  3 Doors Down               Kryptonite  3:53   2000-04-08  wk1    81.0
3  2000  3 Doors Down                    Loser  4:24   2000-10-21  wk1    76.0
4  2000      504 Boyz            Wobble Wobble  3:35   2000-04-15  wk1    57.0


In [9]:
print(billboard_long.tail())

       year            artist                    track  time date.entered  \
24087  2000       Yankee Grey     Another Nine Minutes  3:10   2000-04-29   
24088  2000  Yearwood, Trisha          Real Live Woman  3:55   2000-04-01   
24089  2000   Ying Yang Twins  Whistle While You Tw...  4:19   2000-03-18   
24090  2000     Zombie Nation            Kernkraft 400  3:30   2000-09-02   
24091  2000   matchbox twenty                     Bent  4:12   2000-04-29   

       week  rating  
24087  wk76     NaN  
24088  wk76     NaN  
24089  wk76     NaN  
24090  wk76     NaN  
24091  wk76     NaN  


**Columns Contain Multiple Variables**
+ sometimes columns in a data set may represent multiple variables - common in healthcare data

In [10]:
ebola = pd.read_csv(r'/Users/BrendanErhard/Desktop/Python/Datasets/danielchen/country_timeseries.csv')
print(ebola.columns)

Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
       'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
      dtype='object')


In [11]:
# print select rows

print(ebola.iloc[:5, [0, 1, 2, 3, 10, 11]])

         Date  Day  Cases_Guinea  Cases_Liberia  Deaths_Guinea  Deaths_Liberia
0    1/5/2015  289        2776.0            NaN         1786.0             NaN
1    1/4/2015  288        2775.0            NaN         1781.0             NaN
2    1/3/2015  287        2769.0         8166.0         1767.0          3496.0
3    1/2/2015  286           NaN         8157.0            NaN          3496.0
4  12/31/2014  284        2730.0         8115.0         1739.0          3471.0


In [12]:
# the column names 'Cases_Guinea' and 'Deaths_Guinea' actually contain 2 variables:
# the indivudal status (cases and deaths); as well as the country name, Guinea
ebola_long = pd.melt(ebola, id_vars=['Date', 'Day'])
print(ebola_long.head())

         Date  Day      variable   value
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0


In [13]:
print(ebola_long.tail())

           Date  Day     variable  value
1947  3/27/2014    5  Deaths_Mali    NaN
1948  3/26/2014    4  Deaths_Mali    NaN
1949  3/25/2014    3  Deaths_Mali    NaN
1950  3/24/2014    2  Deaths_Mali    NaN
1951  3/22/2014    0  Deaths_Mali    NaN


**Split and Add Columns Individually- Simple Method**
+ conceptually, the column of interest can be split based on the underscore in the column name, '_'
+ the 1st part will be the new status column, the 2nd part will be the new country colums - this will require some string parsing and splitting
+ will use the 'split' method that takes the string, and splits the string based on a given delimiter
+ by default, split will split the string based on a space, but can pass in the underscore in the example
+ to get access to the string methods, will need to use the 'str' accesso

In [14]:
# get the variable column
# access the string method
# split the column based on a delimiter

variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])

0    [Cases, Guinea]
1    [Cases, Guinea]
2    [Cases, Guinea]
3    [Cases, Guinea]
4    [Cases, Guinea]
Name: variable, dtype: object


In [15]:
print(variable_split[-5:])

1947    [Deaths, Mali]
1948    [Deaths, Mali]
1949    [Deaths, Mali]
1950    [Deaths, Mali]
1951    [Deaths, Mali]
Name: variable, dtype: object


In [16]:
# after splitting on the underscore, the values are returned in a list
# the visual cue is that the results are surrounded by square brackets

# the entire container
print(type(variable_split))

<class 'pandas.core.series.Series'>


In [17]:
# the first element in the container
print(type(variable_split[0]))

<class 'list'>


In [18]:
# the next step is to assign the pieces to a new column
# first, need to extract all the 0-index elements for the status column, and the 1-index elements for the country column
# to do so will need to access the string methods again
# also need to use the [get] method to get the index wanted for each row
status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)

print(status_values[:5])

0    Cases
1    Cases
2    Cases
3    Cases
4    Cases
Name: variable, dtype: object


In [19]:
print(status_values[-5:])

1947    Deaths
1948    Deaths
1949    Deaths
1950    Deaths
1951    Deaths
Name: variable, dtype: object


In [20]:
print(country_values[:5])

0    Guinea
1    Guinea
2    Guinea
3    Guinea
4    Guinea
Name: variable, dtype: object


In [21]:
print(country_values[-5:])

1947    Mali
1948    Mali
1949    Mali
1950    Mali
1951    Mali
Name: variable, dtype: object


In [22]:
# now that have the vectors that want, can add them to the df
ebola_long['status'] = status_values
ebola_long['country'] = country_values
print(ebola_long.head())

         Date  Day      variable   value status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea


**Split ad combine in a single step - simple method**
+ will exploit the fact that the vector retuned is in the same order as the data

In [23]:
variable_split = ebola_long.variable.str.split('_', expand=True)
variable_split.columns = ['status', 'country']
ebola_parsed = pd.concat([ebola_long, variable_split], axis=1)
print(ebola_parsed.head())

         Date  Day      variable   value status country status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea  Cases  Guinea


In [24]:
print(ebola_parsed.tail())

           Date  Day     variable  value  status country  status country
1947  3/27/2014    5  Deaths_Mali    NaN  Deaths    Mali  Deaths    Mali
1948  3/26/2014    4  Deaths_Mali    NaN  Deaths    Mali  Deaths    Mali
1949  3/25/2014    3  Deaths_Mali    NaN  Deaths    Mali  Deaths    Mali
1950  3/24/2014    2  Deaths_Mali    NaN  Deaths    Mali  Deaths    Mali
1951  3/22/2014    0  Deaths_Mali    NaN  Deaths    Mali  Deaths    Mali


**Split and Combine in a single step - more complicated**
+ the vector returned is the same order as the data
+ will concatenate the new vector of the original data
   + but also, the split returns a list of 2 elements, where each element is a new column
   + can combine the list of split items with the built in [zip] function
+ zip takes a set or iterators (ex: lists, tuples), and creates a new container that is made of the input iteraators
+ BUT, each new container created has the sme index as the input containers

In [25]:
# zip example
constants = ['pi', 'e']
values = ['3.14, 2.718']

# can zip the values together
# but have to call on the zip function, to show the contents of the zip object

print(list(zip(constants, values)))

[('pi', '3.14, 2.718')]


In [26]:
# each element now has the constant matched with its corresponding vlaue
# conceptually, each container is like the side of a zipper
# when zip the container, ethe indices are matched and returned
# in python, the asterisk operator * is used to unpack containers

ebola_long['status'], ebola_long['country'] = zip(*ebola_long.variable.str.split('_'))
print(ebola_long.head())

         Date  Day      variable   value status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea


**Variables in Both Rows and Columns**
+ at times data will be foramteed so that variables are in both rows and columns
+ below will see what happens if a column of data holds 2 variables, instead of 1 data
+ will have to pivot, or cast the varaible into separate columns

In [27]:
weather = pd.read_csv(r'/Users/BrendanErhard/Desktop/Python/Datasets/danielchen/weather.csv')
print(weather.iloc[:5, :11])

        id  year  month element  d1    d2    d3  d4    d5  d6  d7
0  MX17004  2010      1    tmax NaN   NaN   NaN NaN   NaN NaN NaN
1  MX17004  2010      1    tmin NaN   NaN   NaN NaN   NaN NaN NaN
2  MX17004  2010      2    tmax NaN  27.3  24.1 NaN   NaN NaN NaN
3  MX17004  2010      2    tmin NaN  14.4  14.4 NaN   NaN NaN NaN
4  MX17004  2010      3    tmax NaN   NaN   NaN NaN  32.1 NaN NaN


In [28]:
# the weather data include minimum and maximum (tmin and tmax)
# the element column contains variables that need to be casted/pivoted to new columns
# the day variables need to be melted into row values

weather_melt = pd.melt(weather, id_vars=['id', 'year', 'month', 'element'],
                      var_name = 'day',
                      value_name = 'temp')
print(weather_melt.head())

        id  year  month element day  temp
0  MX17004  2010      1    tmax  d1   NaN
1  MX17004  2010      1    tmin  d1   NaN
2  MX17004  2010      2    tmax  d1   NaN
3  MX17004  2010      2    tmin  d1   NaN
4  MX17004  2010      3    tmax  d1   NaN


In [29]:
print(weather_melt.tail())

          id  year  month element  day  temp
677  MX17004  2010     10    tmin  d31   NaN
678  MX17004  2010     11    tmax  d31   NaN
679  MX17004  2010     11    tmin  d31   NaN
680  MX17004  2010     12    tmax  d31   NaN
681  MX17004  2010     12    tmin  d31   NaN


In [30]:
# next, need to pivot the variables stored in the element column - this is referred to as casting
# a major diffeence between [pivot_table] and [melt] is that melt is a function within pd, while pivot_table is a method

weather_tidy = weather_melt.pivot_table(index=['id', 'year', 'month', 'day'],
                                       columns='element',
                                       values = 'temp')
weather_tidy.head(8)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,element,tmax,tmin
id,year,month,day,Unnamed: 4_level_1,Unnamed: 5_level_1
MX17004,2010,1,d30,27.8,14.5
MX17004,2010,2,d11,29.7,13.4
MX17004,2010,2,d2,27.3,14.4
MX17004,2010,2,d23,29.9,10.7
MX17004,2010,2,d3,24.1,14.4
MX17004,2010,3,d10,34.5,16.8
MX17004,2010,3,d16,31.1,17.6
MX17004,2010,3,d5,32.1,14.2


In [31]:
# can leave the table in its current state
# but can also flatten the hierarchal columns

weather_tidy_flat = weather_tidy.reset_index()
print(weather_tidy_flat.head())

element       id  year  month  day  tmax  tmin
0        MX17004  2010      1  d30  27.8  14.5
1        MX17004  2010      2  d11  29.7  13.4
2        MX17004  2010      2   d2  27.3  14.4
3        MX17004  2010      2  d23  29.9  10.7
4        MX17004  2010      2   d3  24.1  14.4


In [32]:
# can also apply these methods, without the intermediate df
weather_tidy = weather_melt.pivot_table(index=['id', 'year', 'month', 'day'],
                                       columns='element',
                                       values='temp').reset_index()
print(weather_tidy.head())

element       id  year  month  day  tmax  tmin
0        MX17004  2010      1  d30  27.8  14.5
1        MX17004  2010      2  d11  29.7  13.4
2        MX17004  2010      2   d2  27.3  14.4
3        MX17004  2010      2  d23  29.9  10.7
4        MX17004  2010      2   d3  24.1  14.4


**Multiple observational units in a table - normalization**


In [33]:
print(billboard_long.head())

   year        artist                    track  time date.entered week  rating
0  2000         2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk1    87.0
1  2000       2Ge+her  The Hardest Part Of ...  3:15   2000-09-02  wk1    91.0
2  2000  3 Doors Down               Kryptonite  3:53   2000-04-08  wk1    81.0
3  2000  3 Doors Down                    Loser  4:24   2000-10-21  wk1    76.0
4  2000      504 Boyz            Wobble Wobble  3:35   2000-04-15  wk1    57.0


In [34]:
# now subset the data based on a particular track
print(billboard_long[billboard_long.track == 'Loser'].head())

      year        artist  track  time date.entered week  rating
3     2000  3 Doors Down  Loser  4:24   2000-10-21  wk1    76.0
320   2000  3 Doors Down  Loser  4:24   2000-10-21  wk2    76.0
637   2000  3 Doors Down  Loser  4:24   2000-10-21  wk3    72.0
954   2000  3 Doors Down  Loser  4:24   2000-10-21  wk4    69.0
1271  2000  3 Doors Down  Loser  4:24   2000-10-21  wk5    67.0


In [35]:
# it would be better to store the track information in a separate table
# this way, the info stored in the [year], [artist], [track], and [time] columns would not be repeated
# should replace the year, artist, track, time and data.entered in a few df
# each uniqu set of valused to be assigned a unique ID
# can be thoguht of as reversing the concat and merge data in Ch 4

billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
print(billboard_songs.shape)

(24092, 4)


In [36]:
# need to drop the duplicate rows
billboard_songs = billboard_songs.drop_duplicates()
print(billboard_songs.shape)

(317, 4)


In [37]:
# can then assign a unique value to each row of data
billboard_songs['id'] = range(len(billboard_songs))
print(billboard_songs.head(n=10))

   year          artist                    track  time  id
0  2000           2 Pac  Baby Don't Cry (Keep...  4:22   0
1  2000         2Ge+her  The Hardest Part Of ...  3:15   1
2  2000    3 Doors Down               Kryptonite  3:53   2
3  2000    3 Doors Down                    Loser  4:24   3
4  2000        504 Boyz            Wobble Wobble  3:35   4
5  2000            98^0  Give Me Just One Nig...  3:24   5
6  2000         A*Teens            Dancing Queen  3:44   6
7  2000         Aaliyah            I Don't Wanna  4:15   7
8  2000         Aaliyah                Try Again  4:03   8
9  2000  Adams, Yolanda            Open My Heart  5:30   9


In [38]:
# now can use the newly created id columns to match a song to its weekly ranking
# merge the song df to the original data set

billboard_ratings = billboard_long.merge(billboard_songs, on=['year', 'artist', 'track', 'time'])
print(billboard_ratings.head())

   year artist                    track  time date.entered week  rating  id
0  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk1    87.0   0
1  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk2    82.0   0
2  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk3    72.0   0
3  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk4    77.0   0
4  2000  2 Pac  Baby Don't Cry (Keep...  4:22   2000-02-26  wk5    87.0   0


In [39]:
# finally can subset the columns 
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]
print(billboard_ratings.head())

   id date.entered week  rating
0   0   2000-02-26  wk1    87.0
1   0   2000-02-26  wk2    82.0
2   0   2000-02-26  wk3    72.0
3   0   2000-02-26  wk4    77.0
4   0   2000-02-26  wk5    87.0


**Observational units across multiple tables**


In [49]:
# the last bit of data tidying relates to the situatio in which the same type of data is spread across multiple data sets
# data is sometimes/might be split across files to minimize the size of files
# or could use different files to account for the data collection process
# this section will focus on techniques for quickly loading multiple data sources, and assembling together

# for the files below, the files contain a list of URLs, where each URL is the download link to a part of data
# begin by opening and reading the file; then iterate through each line of the file
# code will download only the first 5 data sets
# then use string manipulation to download the data

import os
import urllib

# code to download the data
# download only the first 5 data sets from the list of files

with open(r'/Users/BrendanErhard/Desktop/Python/Pandas Chapter 6/data/raw_data_urls.txt', 'r') as data_urls:
    for line, url in enumerate(data_urls):
        if line == 5:
            break
        fn = url.split('/')[-1].strip()
        fp = os.path.join('..', 'data', fn)
        print(url)
        print(fp)
        urllib.request.urlretrieve(url, fp)

https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv

../data/fhv_tripdata_2015-01.csv


FileNotFoundError: [Errno 2] No such file or directory: '../data/fhv_tripdata_2015-01.csv'

In [52]:
# can use a simple pattern matching function from glob library, to get a list of all the filenames that match a pattern
import glob

# get a list of the csv files from the nyc-taxi data folder
nyc_taxi_data = glob.glob('../data/fhv_*')
print(nyc_taxi_data)

[]


In [53]:
# load each file into a df
# concatenate the df together

# taxi = pd.concat  (etc)

**Loading multiple files using a Loop**
+ an easier way to load multiple files, is to 1st create an empty list
+ then use a loop to iterate through each of the CSV files
+ then load CSV files into a Pandas df
* then append the df to the list


In [54]:
# create an empty list to append to
list_taxi_df = []

# loop through each CSV filename:
for csv_filename in nyc_taxi_data:
    # you can choose to print the filename for debugging
    # print(csv_filename)
    
    # load the CSV file into a df
    df = pd.read_csv(csv_filename)
    
    # append the df to the list that will hold the dfs
    list_taxi_df.append(df)

# print the length of the df
print(len(list_taxi_df))

# type of the first element
print(type(list_taxi_df[0]))



0


IndexError: list index out of range

**Load Multiple Files using a List comprehension**
+ Python has an idiom for looping through something and adding it to a list -> [list comprehension]


In [55]:
# the loop code without comments
list_taxi_df = []
for csv_filename in nyc_taxi_data:
    df = pd.read_csv(csv_filename)
    list_taxi_df.append(df)

# same code in a list comprehension
list_taxi_df_comp = [pd.read_csv(data) for data in nyc_taxi_data]

print(type(list_taxi_df_comp))

<class 'list'>


In [56]:
# then concatenate the results like earlier
taxi_loop_comp = pd.concat(list_taxi_df_comp)

ValueError: No objects to concatenate