Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Switch from CSV to sqlite3 #89

Open
11 of 12 tasks
astrochun opened this issue May 22, 2022 · 2 comments · May be fixed by #90
Open
11 of 12 tasks

Feature: Switch from CSV to sqlite3 #89

astrochun opened this issue May 22, 2022 · 2 comments · May be fixed by #90
Assignees
Labels
CI/CD continuous integration enhancement New feature or request

Comments

@astrochun
Copy link
Owner

astrochun commented May 22, 2022

We currently generate individual tables. This could in fact be improved with using a SQL library. We could perhaps use SQLModel to support pydantic and pandas.

  • Add SQLModel models for clone, traffic and referring
  • Develop pydantic models for each data view/stats
  • Add db module
    • configure function for setting up SQLAlchemy engine
    • query function to be able to query SQLite by repository_name and date
    • migrate_csv that will work with merged and individual run CSV files (difference is with skip_rows settings).
  • Data Migration:
    • Script that can be executed within the Docker container to pull in merged data and parse that over to sqlite3.db
  • Will need to load the sqlite3 database at some point when the gts_run_all_repos script is executed
  • Be able to transform SQL data to pandas DataFrame to construct figures/charts
  • Update documentation
  • Test migration with a separate repo: astrochun/sqltest-github-stats
@astrochun astrochun added CI/CD continuous integration enhancement New feature or request labels May 22, 2022
@astrochun astrochun self-assigned this May 22, 2022
@astrochun astrochun linked a pull request Jun 4, 2022 that will close this issue
astrochun added a commit that referenced this issue Jun 24, 2022
 - Update Paths model to have optional repository_names
astrochun added a commit that referenced this issue Jul 6, 2022
 - Update Paths model to have optional repository_names
astrochun added a commit that referenced this issue Jul 6, 2022
 - Update Paths model to have optional repository_name field
 - Create data directory for sqlite db if necessary
astrochun added a commit that referenced this issue Jul 17, 2022
 - Update Paths model to have optional repository_name field
 - Create data directory for sqlite db if necessary
@astrochun
Copy link
Owner Author

The referrer tables do not have a timestamp. As such, we need to merge them and include the date the code was run (via the filename). I'm looking for my case whether there are duplicates with a simple merge. In most cases, the records are same with duplicate entries except for the following:

2021-03-17,Evolution-of-Galaxies,github.com,6,1 (first entry gave 4,1)
2021-03-17,MZEvolve,github.com,2,1 (first entry gave 1,1)
2021-03-19,MMTtools,github.com,2,1 (first entry gave 1,1)
2021-03-19,Zcalbase_gal,github.com,9,2 (first entry gave 8,2)

@astrochun
Copy link
Owner Author

astrochun commented Jul 18, 2022

One limitation right now with this feature is that with each run, there are overlap with SQLite db. We need to address this by ensure that the new numbers are used.

There was an ask for this in pandas: pandas-dev/pandas#14553; however, this was never adopted/implemented. I think we would need to do SQLModel updates directly here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD continuous integration enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant