Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get data #49

Closed
RichardLitt opened this issue Feb 11, 2021 · 16 comments
Closed

Unable to get data #49

RichardLitt opened this issue Feb 11, 2021 · 16 comments

Comments

@RichardLitt
Copy link
Contributor

I am trying to run the example commands, and there doesn't seem to be a /data folder, or the permissions for it are wrong. Am I missing something? Thank you.

Error:

➜  OSCI git:(master) python3 osci.py get-github-daily-push-events -d 2020-01-01
[2021-02-11 17:55:21,671] [INFO] ENV: None
[2021-02-11 17:55:21,671] [DEBUG] Check config file for env local exists
[2021-02-11 17:55:21,671] [DEBUG] Read config from /Users/richard/src/OSCI/__app__/config/files/local.yml
[2021-02-11 17:55:21,674] [INFO] Configuration loaded for env: local
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.LocalFileSystemConfig'>
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.Config'>
[2021-02-11 17:55:21,676] [DEBUG] Create new <class '__app__.datalake.datalake.DataLake'>
[2021-02-11 17:55:21,681] [INFO] Crawl events for 2020-01-01 00:00:00
[2021-02-11 17:55:21,681] [INFO] Load events for date: 2020-01-01 00:00:00
[2021-02-11 17:55:21,691] [DEBUG] Starting new HTTPS connection (1): data.gharchive.org:443
[2021-02-11 17:55:21,968] [DEBUG] https://data.gharchive.org:443 "GET /2020-01-01-0.json.gz HTTP/1.1" 200 15670114
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "osci.py", line 75, in <module>
    cli(standalone_mode=False)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/richard/src/OSCI/cli/gharchive.py", line 36, in get_github_daily_push_events
    gharchive.get_github_daily_push_events(day=day)
  File "/Users/richard/src/OSCI/__app__/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events
    DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 39, in save_push_events_commits
    file_path = self._get_hourly_push_events_commits_path(date)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 73, in _get_hourly_push_events_commits_path
    return self.get_push_events_commits_parent_dir(date=date, create_if_not_exists=True) / \
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 69, in get_push_events_commits_parent_dir
    path.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  [Previous line repeated 4 more times]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/data'
@abitrolly
Copy link
Contributor

OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..

The path to data is actually hardcoded here.

BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'

In your case it should be /Users/richard/src/OSCI/__app__/data and not /data. The code that fails to calculate the BASE_PATH.

def _github_events_commits_base(self) -> Union[str, Path]:
return self.BASE_PATH / self.BASE_AREA_DIR / 'github' / 'events' / 'push'

Unless BASE_AREA_DIR is set to some long ../../../.. pattern, this code should not fail.

There could be another explanation for this magic config override, and this is directly related to overengineered code is that it looks like somebody tried to apply singleton pattern to DataLake class in rather straightforward Java way. Which means that whoever called DataLake() first could probably configure and set its path to /data elsewhere. Unfortunately, digging where this happens requires a debugger.

@vlad-isayko
Copy link
Collaborator

@RichardLitt these paths are automatically generated based on the config file. So you need to change your local.yml with the absolute path to the directory which you want to contain information.
For example

base_path: '/data'

Change base_path: '/data' to base_path:'/Users/richard/src/OSCI/data' or something else whatever path you want.

@abitrolly this path not come from

BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'

It gets path from config

@abitrolly
Copy link
Contributor

abitrolly commented Feb 12, 2021

It gets path from config

@vlad-isayko so where is the code that does this?

@abitrolly
Copy link
Contributor

@vlad-isayko after the app is installed, it will not be able to look for local.yml in checkout anymore. What is the supposed location for that file in that case?

@vlad-isayko
Copy link
Collaborator

@abitrolly ,
Setup config for local fs

class LocalFileSystemConfig(FileSystemConfig):
@property
def base_path(self) -> str:
return self.file_system_cfg.get('base_path')
@property
def landing_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.landing_container)
@property
def staging_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.staging_container)
@property
def public_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.public_container)

Initiate local data lakes

@staticmethod
def __get_local_data_lakes() -> Tuple[LocalLandingArea, LocalStagingArea, LocalPublicArea]:
return (LocalLandingArea(**Config().file_system.landing_props),
LocalStagingArea(**Config().file_system.staging_props),
LocalPublicArea(**Config().file_system.public_props))

Get config passed to the constructor

class LocalSystemArea(BaseDataLakeArea):
BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'
FS_PREFIX = 'file'
BASE_AREA_DIR = None
def __init__(self, base_path=BASE_PATH, base_area_dir=BASE_AREA_DIR):
super().__init__()
self.BASE_PATH = Path(base_path)
self.BASE_AREA_DIR = base_area_dir

@vlad-isayko
Copy link
Collaborator

@vlad-isayko after the app is installed, it will not be able to look for local.yml in checkout anymore. What is the supposed location for that file in that case?

local.yml is really not included in the repository, but intentionally, since it is meant as a configuration for personal test runs. And for production launches, we suggest using the transfer of secrets and configurations from environment variables.

There is a file for this prod.yml

It describes from which environment variables the values will be taken.

The source of values is described through the value 'env'

meta:
config_source: 'env'

So, for example, the value of container will be requested from the environment variable osci_landing_container

areas:
landing:
container: 'osci_landing_container'

Secrets from databricks, which are transferred through the dbutils module (proprietary module for Spark clusters in the Databricks environment), can also act as a source of values. An example is found prod-cluster.yml

@abitrolly
Copy link
Contributor

Thanks for the clarifications. The configuration code raises many questions.

And for production launches, we suggest using the transfer of secrets and configurations from environment variables.

Then it would be worth documenting them at https://github.com/epam/OSCI#configuration
Why environment variables can not be used for testing as well?

@RichardLitt
Copy link
Contributor Author

@abitrolly:

OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..

This is really not helpful. Please, be respectful. People have worked really hard on this code, and it does some really important work.

@vlad-isayko Thank you! Should I download data from somewhere, first? Is the data included in this repo?

@abitrolly
Copy link
Contributor

@RichardLitt so how would you say that the code is overengineered and ask if it is autogenerated?

@vlad-isayko
Copy link
Collaborator

@RichardLitt

Depends on what date you want to get results for.

All our YTD reports (that is, the data is counted from the beginning of the year to the required date, for example, for February 13, 2021, it is necessary to download and process data for all dates starting from January 1, 2021).

So for each day, you need to sequentially run several commands:

For example for January 1, 2021

# Load push events for 2021-01-01
python3 osci.py get-github-daily-push-events -d 2021-01-01

# Adds a company field for each commit and filters out those non-company commits
python3 osci.py process-github-daily-push-events -d 2021-01-01

# Highlights repositories that had company commits that day
python3 osci.py daily-active-repositories -d 2021-01-01

# Load info from Github API about repositories that had company commits that day
python3 osci.py load-repositories -d 2021-01-01

# Clears company commits from those commits that were sent to repositories without licenses
# We assumed that the availability of licenses is a factor of belonging to OpenSource (factor suggested by Red Hat https://www.redhat.com/en/topics/open-source/what-is-open-source-software#:~:text= Open% 20source% 20software% 20is% 20released, legally% 20available% 20to% 20end% 2Dusers.)
python3 osci.py filter-unlicensed -d 2021-01-01

# Builds OSCI Ranking and OSCI Commits Ranking reports for January 1, 2021
python3 osci.py daily-osci-rankings -td 2021-01-01

@RichardLitt
Copy link
Contributor Author

Change base_path: '/data' to base_path:'/Users/richard/src/OSCI/data' or something else whatever path you want.

I don't currently have the data. How do I download it? Is that what you're referring to, above?

Is there any way to get data from before 2021?

@RichardLitt
Copy link
Contributor Author

@abitrolly Asking if something is overengineered and autogenerated could be seen as a value judgement, by you, of the quality of the code. Someone has worked hard at that code. Asking "Hey, I'm having trouble finding the relevant areas in the code" is much kinder, because it makes the issue about you and not about their code. I always assume that if there's something I can't understand, it's because I am missing some information - which means that we can work together to solve that problem for others. Claiming that code is confusing is putting the blame on the other party, which isn't a good way to start a conversation for the maintainer. Anyone responding will often be doing so on their own time, so it's kind to make sure that they want to help you.

@abitrolly
Copy link
Contributor

@RichardLitt while I agree with you, I am biased that this repository in not an open source project in a community sense, and all the work being done here is being paid by the outsourcing corporation that need this project for marketing purposes. Doesn't make me a good person to treat paid developers differently than free time maintainers, but at least they get compensated for their time. It is kind of a poor man's rant over the those who better off in a walled garden. Sorry about that.

@RichardLitt
Copy link
Contributor Author

@vlad-isayko I'm sorry that the conversation has been derailed. I appreciate you and your work.


Back to the issue at hand: I don't have any data locally. Where do I get it? Am I missing something?

@RichardLitt
Copy link
Contributor Author

To clarify - I believe you tried to answer this question above, but the first command, python3 osci.py get-github-daily-push-events -d 2021-01-01, also doesn't work without a /data folder.

@RichardLitt
Copy link
Contributor Author

Reread the docs. It's pretty clear I need to download data first from the GH Archive. I think that's what I was missing. Thanks, Vlad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants